Words That Make Language Models Perceive
2510.02425v1
cs.CL, cs.CV, cs.LG
2025-10-07
Авторы:
Sophie L. Wang, Phillip Isola, Brian Cheung
Abstract
Large language models (LLMs) trained purely on text ostensibly lack any
direct perceptual experience, yet their internal representations are implicitly
shaped by multimodal regularities encoded in language. We test the hypothesis
that explicit sensory prompting can surface this latent structure, bringing a
text-only LLM into closer representational alignment with specialist vision and
audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it
cues the model to resolve its next-token predictions as if they were
conditioned on latent visual or auditory evidence that is never actually
supplied. Our findings reveal that lightweight prompt engineering can reliably
activate modality-appropriate representations in purely text-trained LLMs.
Ссылки и действия
Дополнительные ресурсы: