How to Teach Large Multimodal Models New Skills

2510.08564v1 cs.AI, cs.CV, cs.LG 2025-10-11

Авторы:

Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem

Abstract

How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

How to Teach Large Multimodal Models New Skills

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language M...

Guaranteed Optimal Compositional Explanations for Neurons

Fluid Grey 2: How Well Does Generative Adversarial Network Learn Deeper Topology...

KANGURA: Kolmogorov-Arnold Network-Based Geometry-Aware Learning with Unified Re...

TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large ...

Навигация