VC4VG: Optimizing Video Captions for Text-to-Video Generation
2510.24134v2
cs.CV, cs.AI, cs.CL
2025-10-31
Авторы:
Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng, Tiezheng Ge, Bo Zheng, Qin Jin
Abstract
Recent advances in text-to-video (T2V) generation highlight the critical role
of high-quality video-text pairs in training models capable of producing
coherent and instruction-aligned videos. However, strategies for optimizing
video captions specifically for T2V training remain underexplored. In this
paper, we introduce VC4VG (Video Captioning for Video Generation), a
comprehensive caption optimization framework tailored to the needs of T2V
models. We begin by analyzing caption content from a T2V perspective,
decomposing the essential elements required for video reconstruction into
multiple dimensions, and proposing a principled caption design methodology. To
support evaluation, we construct VC4VG-Bench, a new benchmark featuring
fine-grained, multi-dimensional, and necessity-graded metrics aligned with
T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a
strong correlation between improved caption quality and video generation
performance, validating the effectiveness of our approach. We release all
benchmark tools and code at https://github.com/alimama-creative/VC4VG to
support further research.
Ссылки и действия
Дополнительные ресурсы: