TY - JOUR
T1 - An empirical framework for detecting speaking modes using ensemble classifier
AU - Afroze, Sadia
AU - Hossain, Md Rajib
AU - Hoque, Mohammed Moshiul
AU - Dewan, M. Ali Akber
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2024/1
Y1 - 2024/1
N2 - Detecting the speaking modes of human is an important cue in many applications, including detecting active/inactive participants in video conferencing, monitoring students’ attention in classrooms or online, analyzing students’ engagement in live video lectures, and identifying drivers’ distractions. However, automatically detecting speaking mode from a video is challenging due to the low resolution of the images, noise, illumination change, and unfavorable viewing conditions. This paper proposes a deep learning-based ensemble technique (called V-ensemble) to identify speaking modes, i.e., talking and non-talking, considering low-resolution and noisy images. This work also introduces an automatic algorithm for the video stream-to-image frame acquisition and develops three datasets for this research (LLLR, YawDD-M, and SBD-M). The proposed system integrated mouth region extraction and mouth state detection modules. A multi-task cascaded neural network (MTCNN) is used to extract the mouth region. Eight popular deep learning approaches, such as ResNet18, ResNet35, ResNet50, VGG16, VGG19, CNN, InceptionV3 and SVM have been investigated to select the best models for the mouth state prediction. Experimental results with a rigorous comparative analysis showed that the proposed ensemble classifier achieved the highest accuracy on three datasets: LLLR (96.80%), YawDD-M (96.69%) and SBD-M (96.90%).
AB - Detecting the speaking modes of human is an important cue in many applications, including detecting active/inactive participants in video conferencing, monitoring students’ attention in classrooms or online, analyzing students’ engagement in live video lectures, and identifying drivers’ distractions. However, automatically detecting speaking mode from a video is challenging due to the low resolution of the images, noise, illumination change, and unfavorable viewing conditions. This paper proposes a deep learning-based ensemble technique (called V-ensemble) to identify speaking modes, i.e., talking and non-talking, considering low-resolution and noisy images. This work also introduces an automatic algorithm for the video stream-to-image frame acquisition and develops three datasets for this research (LLLR, YawDD-M, and SBD-M). The proposed system integrated mouth region extraction and mouth state detection modules. A multi-task cascaded neural network (MTCNN) is used to extract the mouth region. Eight popular deep learning approaches, such as ResNet18, ResNet35, ResNet50, VGG16, VGG19, CNN, InceptionV3 and SVM have been investigated to select the best models for the mouth state prediction. Experimental results with a rigorous comparative analysis showed that the proposed ensemble classifier achieved the highest accuracy on three datasets: LLLR (96.80%), YawDD-M (96.69%) and SBD-M (96.90%).
KW - Computer vision
KW - Ensemble-based classification
KW - Human computer interaction
KW - Lip motion detection
KW - Speaking mode detection
UR - http://www.scopus.com/inward/record.url?scp=85159352625&partnerID=8YFLogxK
U2 - 10.1007/s11042-023-15254-8
DO - 10.1007/s11042-023-15254-8
M3 - Journal Article
AN - SCOPUS:85159352625
SN - 1380-7501
VL - 83
SP - 2349
EP - 2382
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 1
ER -