ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues
2510.18016v1
cs.CV, cs.LG, I.2.10; I.5.2
2025-10-23
Авторы:
Prateek Gothwal, Deeptimaan Banerjee, Ashis Kumer Biswas
Abstract
Engagement detection in online learning environments is vital for improving
student outcomes and personalizing instruction. We present ViBED-Net
(Video-Based Engagement Detection Network), a novel deep learning framework
designed to assess student engagement from video data using a dual-stream
architecture. ViBED-Net captures both facial expressions and full-scene context
by processing facial crops and entire video frames through EfficientNetV2 for
spatial feature extraction. These features are then analyzed over time using
two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and
Transformer encoders. Our model is evaluated on the DAiSEE dataset, a
large-scale benchmark for affective state recognition in e-learning. To enhance
performance on underrepresented engagement classes, we apply targeted data
augmentation techniques. Among the tested variants, ViBED-Net with LSTM
achieves 73.43\% accuracy, outperforming existing state-of-the-art approaches.
ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal
cues significantly improves engagement detection accuracy. Its modular design
allows flexibility for application across education, user experience research,
and content personalization. This work advances video-based affective computing
by offering a scalable, high-performing solution for real-world engagement
analysis. The source code for this project is available on
https://github.com/prateek-gothwal/ViBED-Net .