An empirical framework for detecting speaking modes using ensemble classifier

Sadia Afroze, Md Rajib Hossain, Mohammed Moshiul Hoque, M. Ali Akber Dewan

Research output: Contribution to journalJournal Articlepeer-review

2 Citations (Scopus)


Detecting the speaking modes of human is an important cue in many applications, including detecting active/inactive participants in video conferencing, monitoring students’ attention in classrooms or online, analyzing students’ engagement in live video lectures, and identifying drivers’ distractions. However, automatically detecting speaking mode from a video is challenging due to the low resolution of the images, noise, illumination change, and unfavorable viewing conditions. This paper proposes a deep learning-based ensemble technique (called V-ensemble) to identify speaking modes, i.e., talking and non-talking, considering low-resolution and noisy images. This work also introduces an automatic algorithm for the video stream-to-image frame acquisition and develops three datasets for this research (LLLR, YawDD-M, and SBD-M). The proposed system integrated mouth region extraction and mouth state detection modules. A multi-task cascaded neural network (MTCNN) is used to extract the mouth region. Eight popular deep learning approaches, such as ResNet18, ResNet35, ResNet50, VGG16, VGG19, CNN, InceptionV3 and SVM have been investigated to select the best models for the mouth state prediction. Experimental results with a rigorous comparative analysis showed that the proposed ensemble classifier achieved the highest accuracy on three datasets: LLLR (96.80%), YawDD-M (96.69%) and SBD-M (96.90%).

Original languageEnglish
Pages (from-to)2349-2382
Number of pages34
JournalMultimedia Tools and Applications
Issue number1
Publication statusPublished - Jan. 2024


  • Computer vision
  • Ensemble-based classification
  • Human computer interaction
  • Lip motion detection
  • Speaking mode detection


Dive into the research topics of 'An empirical framework for detecting speaking modes using ensemble classifier'. Together they form a unique fingerprint.

Cite this