Words That Make Language Models Perceive

2510.02425v1 cs.CL, cs.CV, cs.LG 2025-10-07
Авторы:

Sophie L. Wang, Phillip Isola, Brian Cheung

Abstract

Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

Ссылки и действия

Связанные статьи

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on ...

## Контекст Оценка возможностей текущих бо LARGE REASONING MODELS (LRMs) в области рационального анализа текстовых и виз...

2025-09-24

ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Atten...

#### Контекст Картинки являются важной визуальной формой представления информации, играя ключевую роль в обмене информац...

2025-09-18

11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspi...

#### Контекст Многомодальные большие языковые модели (MLLMs) показали впечатляющий прогресс в различных задачах, включая...

2025-08-29