LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction

2510.22829v1 cs.CV, cs.AI, cs.MM 2025-10-29

Авторы:

Aleksandar Pramov

Abstract

This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper's codebase can be found at https://github.com/dsgt-arc/mediaeval-2025-memorability

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction

Авторы:

Abstract

Ссылки и действия

Связанные статьи

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Under...

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Under...

Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Tracking and Segmenting Anything in Any Modality

Decoupled Audio-Visual Dataset Distillation

Навигация