SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection
2510.18034v1
cs.CV, cs.AI, cs.RO, I.2.9; I.4.8
2025-10-23
Авторы:
Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz
Abstract
Autonomous driving systems remain critically vulnerable to the long-tail of
rare, out-of-distribution scenarios with semantic anomalies. While Vision
Language Models (VLMs) offer promising reasoning capabilities, naive prompting
approaches yield unreliable performance and depend on expensive proprietary
models, limiting practical deployment. We introduce SAVANT (Semantic Analysis
with Vision-Augmented Anomaly deTection), a structured reasoning framework that
achieves high accuracy and recall in detecting anomalous driving scenarios from
input images through layered scene analysis and a two-phase pipeline:
structured scene description extraction followed by multi-modal evaluation. Our
approach transforms VLM reasoning from ad-hoc prompting to systematic analysis
across four semantic layers: Street, Infrastructure, Movable Objects, and
Environment. SAVANT achieves 89.6% recall and 88.0% accuracy on real-world
driving scenarios, significantly outperforming unstructured baselines. More
importantly, we demonstrate that our structured framework enables a fine-tuned
7B parameter open-source model (Qwen2.5VL) to achieve 90.8% recall and 93.8%
accuracy - surpassing all models evaluated while enabling local deployment at
near-zero cost. By automatically labeling over 9,640 real-world images with
high accuracy, SAVANT addresses the critical data scarcity problem in anomaly
detection and provides a practical path toward reliable, accessible semantic
monitoring for autonomous systems.