TY - JOUR
T1 - Enhancing image caption generation through context-aware attention mechanism
AU - Bhuiyan, Ahatesham
AU - Hossain, Eftekhar
AU - Hoque, Mohammed Moshiul
AU - Ali Akber Dewan, M.
N1 - Publisher Copyright:
© 2024 The Author(s)
PY - 2024/9/15
Y1 - 2024/9/15
N2 - Image captioning, the process of generating natural language descriptions based on image content, has garnered attention in AI research for its implications in scene understanding and human-computer interaction. While much prior research has focused on caption generation for English, addressing low-resource languages like Bengali presents challenges, particularly in producing coherent captions linking visual objects with corresponding words. This paper proposes a context-aware attention mechanism over semantic attention to accurately diagnose objects for image captioning in Bengali. The proposed architecture consists of an encoder and a decoder block. We chose ResNet-50 over the other pre-trained models for encoding the image features due to its ability to solve the vanishing gradient problem and recognize complex object features. For decoding generated captions, a bidirectional Gated Recurrent Unit (GRU) architecture combined with an attention mechanism captures contextual dependencies in both directions, resulting in more accurate captions. The paper also highlights the challenge of transferring knowledge between domains, especially with culturally specific images. Evaluation of three Bengali benchmark datasets, namely BAN-Cap, BanglaLekhaImageCaption, and Bornon, demonstrates significant performance improvement in METEOR score over existing methods by approximately 30%, 18%, and 45%, respectively. The proposed context-aware, attention-based image captioning system significantly outperforms current state-of-the-art models in Bengali caption generation despite limitations in reference captions on certain datasets.
AB - Image captioning, the process of generating natural language descriptions based on image content, has garnered attention in AI research for its implications in scene understanding and human-computer interaction. While much prior research has focused on caption generation for English, addressing low-resource languages like Bengali presents challenges, particularly in producing coherent captions linking visual objects with corresponding words. This paper proposes a context-aware attention mechanism over semantic attention to accurately diagnose objects for image captioning in Bengali. The proposed architecture consists of an encoder and a decoder block. We chose ResNet-50 over the other pre-trained models for encoding the image features due to its ability to solve the vanishing gradient problem and recognize complex object features. For decoding generated captions, a bidirectional Gated Recurrent Unit (GRU) architecture combined with an attention mechanism captures contextual dependencies in both directions, resulting in more accurate captions. The paper also highlights the challenge of transferring knowledge between domains, especially with culturally specific images. Evaluation of three Bengali benchmark datasets, namely BAN-Cap, BanglaLekhaImageCaption, and Bornon, demonstrates significant performance improvement in METEOR score over existing methods by approximately 30%, 18%, and 45%, respectively. The proposed context-aware, attention-based image captioning system significantly outperforms current state-of-the-art models in Bengali caption generation despite limitations in reference captions on certain datasets.
KW - Attention mechanism
KW - Computer vision
KW - Cross-domain transfer
KW - Encoder-decoder
KW - Image captioning
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85202213400&partnerID=8YFLogxK
U2 - 10.1016/j.heliyon.2024.e36272
DO - 10.1016/j.heliyon.2024.e36272
M3 - Journal Article
AN - SCOPUS:85202213400
VL - 10
JO - Heliyon
JF - Heliyon
IS - 17
M1 - e36272
ER -