Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential
2510.15216v2
cs.LG, cs.CL
2025-10-22
Авторы:
Xuansheng Wu, Xiaoman Pan, Wenlin Yao, Jianshu Chen
Abstract
Reinforcement learning with verifiable rewards (RLVR) can elicit strong
reasoning in large language models (LLMs), while their performance after RLVR
varies dramatically across different base models. This raises a fundamental
question: what microscopic property of pre-trained models leads to this
variation? To investigate, we formalize reasoning as chains of Horn clauses
("if-then" rules) built from features extracted from the LLM's latent space via
cross-layer sparse autoencoders (SAEs). We estimate the transition
probabilities between its features, and further categorize each rule by its
semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key
discovery is that high-potential models are inherently soundness-aware: their
internal probability distributions systematically shift across rules' soundness
levels, becoming highly distinct for "strict" versus "noisy" rules. In
contrast, weaker models are soundness-agnostic, collapsing to one distribution
regardless of soundness levels. To quantify this, we introduce the
Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon
Divergence to measure the separation between these distributions. We show that
SAL's predictions of post-RLVR reasoning performance follow a precise empirical
law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek)
and scales (0.5B-14B). This reveals that a model's reasoning potential is tied
to its intrinsic, pre-trained ability to distinguish sound knowledge from
unsound ones. These findings underscore the critical role of model pre-training
in shaping reasoning and offer a practical metric grounded in the model's
internal mechanisms for selecting/designing stronger base models.
Ссылки и действия
Дополнительные ресурсы: