LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction
2510.22829v1
cs.CV, cs.AI, cs.MM
2025-10-29
Авторы:
Aleksandar Pramov
Abstract
This paper addresses the prediction of commercial (brand) memorability as
part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability:
Predicting movie and commercial memorability" task at the MediaEval 2025
workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM
backbone that integrates pre-computed visual (ViT) and textual (E5) features by
multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA).
A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key
contribution is the use of LLM-generated rationale prompts, grounded in
expert-derived aspects of memorability, to guide the fusion model. The results
demonstrate that the LLM-based system exhibits greater robustness and
generalization performance on the final test set, compared to the baseline.
The paper's codebase can be found at
https://github.com/dsgt-arc/mediaeval-2025-memorability
Ссылки и действия
Дополнительные ресурсы: