SPT-Swin: A Shifted Patch Tokenization Swin Transformer for Image Classification

Gazi Jannatul Ferdous, Khaleda Akhter Sathi, Md Azad Hossain, M. Ali Akber Dewan

Research output: Contribution to journalJournal Articlepeer-review

Abstract

Recently, the transformer-based model e.g., the vision transformer (ViT) has been extensively used in computer vision tasks. The superior performance of the ViT leads to the requirement of an enormous dataset and the complexity of calculating self-attention between patches is quadratic in nature. To acknowledge these two concerns, this paper proposes a novel shifted patch tokenization swin transformer (SPT-Swin) for the image classification task. The shifted patch tokenization (SPT) compensates for the data deficiency by increasing the data samples based on spatial information of the image patches while the swin transformer provides linear computational complexity by calculating self-attention between the shifted window based patches. For model validation, the SPT-Swin framework is trained on popular benchmark image datasets such as ImageNet-1K, CIFAR-10 and CIFAR-100, and the classification accuracies are found 89.45%, 95.67% and 92.95% respectively. Moreover, the comparative analysis of the proposed model with the existing state-of-the-art models shows that the classification performances are improved by 7.05%, 4.14%, and 8.30% for the ImageNet-1K, CIFAR-10 and CIFAR-100 datasets respectively. Therefore, our proposed SPT-based data augmentation technique with the core swin transformer model could be a data-efficient linear complex-able model for future computer vision tasks.

Original languageEnglish
Pages (from-to)117617-117626
Number of pages10
JournalIEEE Access
Volume12
DOIs
Publication statusPublished - 2024

Keywords

  • Data efficiency
  • image classification
  • linear complexity
  • shifted patch tokenization
  • swin transformer

Fingerprint

Dive into the research topics of 'SPT-Swin: A Shifted Patch Tokenization Swin Transformer for Image Classification'. Together they form a unique fingerprint.

Cite this