TY - JOUR
T1 - BVQA
T2 - Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
AU - Bhuyan, Md Shalha Mucha
AU - Hossain, Eftekhar
AU - Sathi, Khaleda Akhter
AU - Hossain, Md Azad
AU - Dewan, M. Ali Akber
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2025
Y1 - 2025
N2 - Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applications in aiding visually impaired individuals, enhancing human-computer interaction, facilitating content-based image retrieval systems, etc. While there has been extensive research on VQA, most were predominantly focused on English, often overlooking the complexity associated with low-resource languages, especially in Bengali. To facilitate research in this arena, we have developed a large scale Bengali Visual Question Answering (BVQA) dataset by harnessing the in-context learning abilities of the Large Language Model (LLM). Our BVQA dataset encompasses around 17,800 diverse open-ended QA Pairs generated from the human-annotated captions of ≈3,500 images. Replicating existing VQA systems for a low-resource language poses significant challenges due to the complex nature of their architectures and adaptations for particular languages. To overcome this challenge, we proposed Multimodal CRoss-Attention Network (MCRAN), a novel framework that leverages pretrained transformer architectures to encode the visual and textual information. Furthermore, our method utilizes a multi-head attention mechanism to generate three distinct vision-language representations and fuses them using a sophisticated gating mechanism to answer the query regarding an image. Extensive experiments on BVQA dataset show that the proposed method outperformed the existing baseline across various answer categories. The benchmark and source code is available at https://github.com/eftekhar-hossain/Bengali-VQA.
AB - Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applications in aiding visually impaired individuals, enhancing human-computer interaction, facilitating content-based image retrieval systems, etc. While there has been extensive research on VQA, most were predominantly focused on English, often overlooking the complexity associated with low-resource languages, especially in Bengali. To facilitate research in this arena, we have developed a large scale Bengali Visual Question Answering (BVQA) dataset by harnessing the in-context learning abilities of the Large Language Model (LLM). Our BVQA dataset encompasses around 17,800 diverse open-ended QA Pairs generated from the human-annotated captions of ≈3,500 images. Replicating existing VQA systems for a low-resource language poses significant challenges due to the complex nature of their architectures and adaptations for particular languages. To overcome this challenge, we proposed Multimodal CRoss-Attention Network (MCRAN), a novel framework that leverages pretrained transformer architectures to encode the visual and textual information. Furthermore, our method utilizes a multi-head attention mechanism to generate three distinct vision-language representations and fuses them using a sophisticated gating mechanism to answer the query regarding an image. Extensive experiments on BVQA dataset show that the proposed method outperformed the existing baseline across various answer categories. The benchmark and source code is available at https://github.com/eftekhar-hossain/Bengali-VQA.
KW - Visual question answering
KW - large language model
KW - multi-head attention mechanism
KW - multimodal deep learning
KW - natural language processing
UR - https://www.scopus.com/pages/publications/85217944029
U2 - 10.1109/ACCESS.2025.3540388
DO - 10.1109/ACCESS.2025.3540388
M3 - Journal Article
AN - SCOPUS:85217944029
VL - 13
SP - 27570
EP - 27586
JO - IEEE Access
JF - IEEE Access
ER -