TY - JOUR
T1 - AVaTER
T2 - Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition
AU - Das, Avishek
AU - Sarma, Moumita Sen
AU - Hoque, Mohammed Moshiul
AU - Siddique, Nazmul
AU - Dewan, M. Ali Akber
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/9
Y1 - 2024/9
N2 - Multimodal emotion classification (MEC) involves analyzing and identifying human emotions by integrating data from multiple sources, such as audio, video, and text. This approach leverages the complementary strengths of each modality to enhance the accuracy and robustness of emotion recognition systems. However, one significant challenge is effectively integrating these diverse data sources, each with unique characteristics and levels of noise. Additionally, the scarcity of large, annotated multimodal datasets in Bangla limits the training and evaluation of models. In this work, we unveiled a pioneering multimodal Bangla dataset, MAViT-Bangla (Multimodal Audio Video Text Bangla dataset). This dataset, comprising 1002 samples across audio, video, and text modalities, is a unique resource for emotion recognition studies in the Bangla language. It features emotional categories such as anger, fear, joy, and sadness, providing a comprehensive platform for research. Additionally, we developed a framework for audio, video and textual emotion recognition (i.e., AVaTER) that employs a cross-modal attention mechanism among unimodal features. This mechanism fosters the interaction and fusion of features from different modalities, enhancing the model’s ability to capture nuanced emotional cues. The effectiveness of this approach was demonstrated by achieving an F1-score of 0.64, a significant improvement over unimodal methods.
AB - Multimodal emotion classification (MEC) involves analyzing and identifying human emotions by integrating data from multiple sources, such as audio, video, and text. This approach leverages the complementary strengths of each modality to enhance the accuracy and robustness of emotion recognition systems. However, one significant challenge is effectively integrating these diverse data sources, each with unique characteristics and levels of noise. Additionally, the scarcity of large, annotated multimodal datasets in Bangla limits the training and evaluation of models. In this work, we unveiled a pioneering multimodal Bangla dataset, MAViT-Bangla (Multimodal Audio Video Text Bangla dataset). This dataset, comprising 1002 samples across audio, video, and text modalities, is a unique resource for emotion recognition studies in the Bangla language. It features emotional categories such as anger, fear, joy, and sadness, providing a comprehensive platform for research. Additionally, we developed a framework for audio, video and textual emotion recognition (i.e., AVaTER) that employs a cross-modal attention mechanism among unimodal features. This mechanism fosters the interaction and fusion of features from different modalities, enhancing the model’s ability to capture nuanced emotional cues. The effectiveness of this approach was demonstrated by achieving an F1-score of 0.64, a significant improvement over unimodal methods.
KW - cross-modal attention
KW - multimodal dataset
KW - multimodal emotion recognition
KW - natural language processing
KW - transformers
UR - http://www.scopus.com/inward/record.url?scp=85205226546&partnerID=8YFLogxK
U2 - 10.3390/s24185862
DO - 10.3390/s24185862
M3 - Journal Article
C2 - 39338607
AN - SCOPUS:85205226546
VL - 24
JO - Sensors
JF - Sensors
IS - 18
M1 - 5862
ER -