Efficient Test-Time Scaling for Small Vision-Language Models
2510.03574v1
cs.LG, cs.CV
2025-10-08
Авторы:
Mehmet Onurcan Kaya, Desmond Elliott, Dim P. Papadopoulos
Abstract
Small Vision-Language Models (VLMs) provide a computationally efficient
alternative to larger models, at the cost of weaker generalization abilities
and downstream task performance. These shortcomings could be addressed by
test-time scaling techniques, but existing methods are typically
computationally demanding, contradicting the resource-efficient design goals of
small models. To address these limitations, we propose two novel and efficient
test-time scaling strategies that leverage the model-internal features rather
than external supervision: (i) Test-Time Augmentation (TTAug), which generates
multiple augmented inputs and aggregates outputs at the token level without
parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model
parameters during inference using consensus-based pseudolabels from TTAug.
Through extensive experiments across nine benchmarks, we demonstrate consistent
performance improvements while maintaining computational efficiency suitable
for resource-constrained environments. The generality of our approach is
demonstrated both within models at different scales and across different VLMs
without additional tuning.
Ссылки и действия
Дополнительные ресурсы: