Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning
2510.22056v1
cs.CV, cs.AI, I.2.10; I.4.9; I.2.6
2025-10-29
Авторы:
Mohammad Ali Etemadi Naeen, Hoda Mohammadzade, Saeed Bagheri Shouraki
Abstract
Anomaly detection in surveillance videos remains a challenging task due to
the diversity of abnormal events, class imbalance, and scene-dependent visual
clutter. To address these issues, we propose a robust deep learning framework
that integrates human-centric preprocessing with spatio-temporal modeling for
multi-class anomaly classification. Our pipeline begins by applying YOLO-World
- an open-vocabulary vision-language detector - to identify human instances in
raw video clips, followed by ByteTrack for consistent identity-aware tracking.
Background regions outside detected bounding boxes are suppressed via Gaussian
blurring, effectively reducing scene-specific distractions and focusing the
model on behaviorally relevant foreground content. The refined frames are then
processed by an ImageNet-pretrained InceptionV3 network for spatial feature
extraction, and temporal dynamics are captured using a bidirectional LSTM
(BiLSTM) for sequence-level classification. Evaluated on a five-class subset of
the UCF-Crime dataset (Normal, Burglary, Fighting, Arson, Explosion), our
method achieves a mean test accuracy of 92.41% across three independent trials,
with per-class F1-scores consistently exceeding 0.85. Comprehensive evaluation
metrics - including confusion matrices, ROC curves, and macro/weighted averages
- demonstrate strong generalization and resilience to class imbalance. The
results confirm that foreground-focused preprocessing significantly enhances
anomaly discrimination in real-world surveillance scenarios.