Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
2510.09599v1
cs.CL, cs.AI, cs.LG
2025-10-14
Авторы:
Sondos Mahmoud Bsharat, Zhiqiang Shen
Abstract
Large language models (LLMs) have demonstrated impressive reasoning
capabilities when provided with chain-of-thought exemplars, but curating large
reasoning datasets remains laborious and resource-intensive. In this work, we
introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective
inference-time data augmentation strategy for enhancing LLM reasoning through
finetuning. Rather than collecting thousands or even millions of examples,
P-TTS leverages a small pool of only 90 manually selected reasoning instances
and systematically varies exemplar augmentation through principled instruction
prompting intensities at test time to synthesize diverse reasoning trajectory
contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data.
Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and
GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive
baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of
+26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B);
P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and
+3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better
performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances
zero-shot generalization accuracy on out-of-domain reasoning benchmarks of
Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our
analysis suggests that test-time scaling effectively explores the latent space
of reasoning patterns, amplifying LLM problem-solving with minimal annotation
overhead, and further unlocking the reasoning potential and capabilities of
LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit
LLM reasoning in resource-constrained or rapidly evolving domains.
Ссылки и действия
Дополнительные ресурсы: