FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
2509.25745v1
cs.CV, cs.CL, cs.MM
2025-10-02
Авторы:
Siddhant Sukhani, Yash Bhardwaj, Riya Bhadani, Veer Kejriwal, Michael Galarnyk, Sudheer Chava
Abstract
We evaluate multimodal large language models (MLLMs) for topic-aligned
captioning in financial short-form videos (SVs) by testing joint reasoning over
transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we
assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five
topics: main recommendation, sentiment analysis, video purpose, visual
analysis, and financial entity recognition. Video alone performs strongly on
four of five topics, underscoring its value for capturing visual context and
effective cues such as emotions, gestures, and body language. Selective pairs
such as TV or AV often surpass TAV, implying that too many modalities may
introduce noise. These results establish the first baselines for financial
short-form video captioning and illustrate the potential and challenges of
grounding complex visual cues in this domain. All code and data can be found on
our Github under the CC-BY-NC-SA 4.0 license.
Ссылки и действия
Дополнительные ресурсы: