TY - JOUR
T1 - SPT-Swin
T2 - A Shifted Patch Tokenization Swin Transformer for Image Classification
AU - Ferdous, Gazi Jannatul
AU - Sathi, Khaleda Akhter
AU - Hossain, Md Azad
AU - Ali Akber Dewan, M.
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024
Y1 - 2024
N2 - Recently, the transformer-based model e.g., the vision transformer (ViT) has been extensively used in computer vision tasks. The superior performance of the ViT leads to the requirement of an enormous dataset and the complexity of calculating self-attention between patches is quadratic in nature. To acknowledge these two concerns, this paper proposes a novel shifted patch tokenization swin transformer (SPT-Swin) for the image classification task. The shifted patch tokenization (SPT) compensates for the data deficiency by increasing the data samples based on spatial information of the image patches while the swin transformer provides linear computational complexity by calculating self-attention between the shifted window based patches. For model validation, the SPT-Swin framework is trained on popular benchmark image datasets such as ImageNet-1K, CIFAR-10 and CIFAR-100, and the classification accuracies are found 89.45%, 95.67% and 92.95% respectively. Moreover, the comparative analysis of the proposed model with the existing state-of-the-art models shows that the classification performances are improved by 7.05%, 4.14%, and 8.30% for the ImageNet-1K, CIFAR-10 and CIFAR-100 datasets respectively. Therefore, our proposed SPT-based data augmentation technique with the core swin transformer model could be a data-efficient linear complex-able model for future computer vision tasks.
AB - Recently, the transformer-based model e.g., the vision transformer (ViT) has been extensively used in computer vision tasks. The superior performance of the ViT leads to the requirement of an enormous dataset and the complexity of calculating self-attention between patches is quadratic in nature. To acknowledge these two concerns, this paper proposes a novel shifted patch tokenization swin transformer (SPT-Swin) for the image classification task. The shifted patch tokenization (SPT) compensates for the data deficiency by increasing the data samples based on spatial information of the image patches while the swin transformer provides linear computational complexity by calculating self-attention between the shifted window based patches. For model validation, the SPT-Swin framework is trained on popular benchmark image datasets such as ImageNet-1K, CIFAR-10 and CIFAR-100, and the classification accuracies are found 89.45%, 95.67% and 92.95% respectively. Moreover, the comparative analysis of the proposed model with the existing state-of-the-art models shows that the classification performances are improved by 7.05%, 4.14%, and 8.30% for the ImageNet-1K, CIFAR-10 and CIFAR-100 datasets respectively. Therefore, our proposed SPT-based data augmentation technique with the core swin transformer model could be a data-efficient linear complex-able model for future computer vision tasks.
KW - Data efficiency
KW - image classification
KW - linear complexity
KW - shifted patch tokenization
KW - swin transformer
UR - http://www.scopus.com/inward/record.url?scp=85201758325&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2024.3448304
DO - 10.1109/ACCESS.2024.3448304
M3 - Journal Article
AN - SCOPUS:85201758325
VL - 12
SP - 117617
EP - 117626
JO - IEEE Access
JF - IEEE Access
ER -