Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
2510.16781v1
cs.CV, cs.AI, cs.CL
2025-10-22
Авторы:
Shihao Ji, Zihui Song
Abstract
The remarkable zero-shot reasoning capabilities of large-scale Visual
Language Models (VLMs) on static images have yet to be fully translated to the
video domain. Conventional video understanding models often rely on extensive,
task-specific training on annotated datasets, a process that is both costly and
limited in scalability. This paper introduces a novel, training-free framework
for video understanding that circumvents end-to-end training by synergistically
combining the rich semantic priors of pre-trained VLMs with classic machine
learning algorithms for pattern discovery. Our core idea is to reframe video
understanding as a self-supervised spatio-temporal clustering problem within a
high-dimensional semantic feature space. The proposed pipeline first transforms
a video stream into a semantic feature trajectory using the frozen visual
encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal
Segmentation (KTS), a robust machine learning technique, to partition the
continuous feature stream into discrete, semantically coherent event segments.
These segments are then subjected to unsupervised density-based clustering to
identify recurring macroscopic scenes and themes throughout the video. By
selecting representative keyframes from each discovered cluster and leveraging
the VLM's generative capabilities for textual description, our framework
automatically produces a structured, multi-modal summary of the video content.
This approach provides an effective, interpretable, and model-agnostic pathway
for zero-shot, automated structural analysis of video content.
Ссылки и действия
Дополнительные ресурсы: