Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
2510.05283v1
cs.AI, cs.CL, cs.CV
2025-10-09
Авторы:
Radha Gulhane, Sathish Reddy Indurthi
Abstract
Aligning multimodal large language models (MLLMs) with human preferences
often relies on single-signal, model-based reward methods. Such monolithic
rewards often lack confidence calibration across domain-specific tasks, fail to
capture diverse aspects of human preferences, and require extensive data
annotation and reward model training. In this work, we propose a hybrid reward
modeling framework that integrates complementary reward paradigms: (i)
model-based rewards, where a learned reward model predicts scalar or vector
scores from synthetic and human feedback, and (ii) rule-based rewards, where
domain-specific heuristics provide explicit correctness signals with
confidence. Beyond accuracy, we further incorporate multi-aspect rewards to
enforce instruction adherence and introduce a generalized length-penalty reward
to stabilize training and improve performance. The proposed framework provides
a flexible and effective approach to aligning MLLMs through reinforcement
learning policy optimization. Our experiments show consistent improvements
across different multimodal benchmarks when applying hybrid and multi-aspect
reward modeling. Our best performing model in the 3B family achieves an overall
average improvement of ~9.5% across general and math reasoning tasks. Focusing
specifically on mathematical benchmarks, the model achieves a significant
average improvement of ~16%, highlighting its effectiveness in mathematical
reasoning and problem solving.
Ссылки и действия
Дополнительные ресурсы: