TY - JOUR
T1 - CochleaSpecNet
T2 - An Attention-Based Dual Branch Hybrid CNN-GRU Network for Speech Emotion Recognition Using Cochleagram and Spectrogram
AU - Anika Namey, Atkia
AU - Akter, Khadija
AU - Hossain, Md Azad
AU - Ali Akber Dewan, M.
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024
Y1 - 2024
N2 - Being one of the main communication medium, speech contains necessary information about the emotional state of a human. Accurate emotion recognition is crucial for enhancing human-machine interactions, highlighting the importance of a strong Speech Emotion Recognition (SER) system. SER system classifies the human emotional state based on speaker's utterances in different catagories such as sad, happy, neutral, angry, surprise, calm and so on. This research introduces a novel SER approach that utilizes cochleagram and spectrogram features to capture relevant speech patterns for the classifier network. The network integrates a hybrid model that combines Convolutional Neural Networks (CNN) for feature extraction with Gated Recurrent Units (GRU) to handle temporal dependencies. Furthermore, to improve the performance of this network, a multi-head attention mechanism has been incorporated following the GRU layer. Despite increasing interest in SER, there is a notable lack of studies using Bangla language datasets, revealing a significant gap in current research. To address this gap, evaluation of the model has been conducted on the augmented BanglaSER (Bangla Speech Emotion Recognition) dataset in which the model has achieved a notable accuracy of 92.04% in categorizing five distinct emotions: angry, surprise, happy, neutral, and sad. Additionally, to further evaluate the performance of the SER model, English language based RAVDESS (Ryerson Audio-Visual Database of Emotional Speech) dataset has also been employed into the proposed model. This attempt has provided 82.40% accuracy in classifying eight diverse emotions that includes fear, disgust, calm along with the emotions of BanglaSER. Moreover, a comparative analysis of the proposed model with existing SER approaches is carried out to demonstrate it's stability and robustness. The incorporation of two individual features as inputs into the attention guided hybrid neural network showcases the efficacy of the proposed SER system, offering a promising approach for precise and efficient emotion categorization from speech signals.
AB - Being one of the main communication medium, speech contains necessary information about the emotional state of a human. Accurate emotion recognition is crucial for enhancing human-machine interactions, highlighting the importance of a strong Speech Emotion Recognition (SER) system. SER system classifies the human emotional state based on speaker's utterances in different catagories such as sad, happy, neutral, angry, surprise, calm and so on. This research introduces a novel SER approach that utilizes cochleagram and spectrogram features to capture relevant speech patterns for the classifier network. The network integrates a hybrid model that combines Convolutional Neural Networks (CNN) for feature extraction with Gated Recurrent Units (GRU) to handle temporal dependencies. Furthermore, to improve the performance of this network, a multi-head attention mechanism has been incorporated following the GRU layer. Despite increasing interest in SER, there is a notable lack of studies using Bangla language datasets, revealing a significant gap in current research. To address this gap, evaluation of the model has been conducted on the augmented BanglaSER (Bangla Speech Emotion Recognition) dataset in which the model has achieved a notable accuracy of 92.04% in categorizing five distinct emotions: angry, surprise, happy, neutral, and sad. Additionally, to further evaluate the performance of the SER model, English language based RAVDESS (Ryerson Audio-Visual Database of Emotional Speech) dataset has also been employed into the proposed model. This attempt has provided 82.40% accuracy in classifying eight diverse emotions that includes fear, disgust, calm along with the emotions of BanglaSER. Moreover, a comparative analysis of the proposed model with existing SER approaches is carried out to demonstrate it's stability and robustness. The incorporation of two individual features as inputs into the attention guided hybrid neural network showcases the efficacy of the proposed SER system, offering a promising approach for precise and efficient emotion categorization from speech signals.
KW - cochleagram
KW - hybrid network
KW - multi-head attention
KW - spectrogram
KW - Speech emotion
UR - http://www.scopus.com/inward/record.url?scp=85212628604&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2024.3517733
DO - 10.1109/ACCESS.2024.3517733
M3 - Journal Article
AN - SCOPUS:85212628604
VL - 12
SP - 190760
EP - 190774
JO - IEEE Access
JF - IEEE Access
ER -