Finding the Sweet Spot: Trading Quality, Cost, and Speed During Inference-Time LLM Reflection
2510.20653v1
stat.ML, cs.AI, cs.LG
2025-10-25
Авторы:
Jack Butler, Nikita Kozodoi, Zainab Afolabi, Brian Tyacke, Gaiar Baimuratov
Abstract
As Large Language Models (LLMs) continue to evolve, practitioners face
increasing options for enhancing inference-time performance without model
retraining, including budget tuning and multi-step techniques like
self-reflection. While these methods improve output quality, they create
complex trade-offs among accuracy, cost, and latency that remain poorly
understood across different domains. This paper systematically compares
self-reflection and budget tuning across mathematical reasoning and translation
tasks. We evaluate prominent LLMs, including Anthropic Claude, Amazon Nova, and
Mistral families, along with other models under varying reflection depths and
compute budgets to derive Pareto optimal performance frontiers. Our analysis
reveals substantial domain dependent variation in self-reflection
effectiveness, with performance gains up to 220\% in mathematical reasoning. We
further investigate how reflection round depth and feedback mechanism quality
influence performance across model families. To validate our findings in a
real-world setting, we deploy a self-reflection enhanced marketing content
localisation system at Lounge by Zalando, where it shows market-dependent
effectiveness, reinforcing the importance of domain specific evaluation when
deploying these techniques. Our results provide actionable guidance for
selecting optimal inference strategies given specific domains and resource
constraints. We open source our self-reflection implementation for
reproducibility at
https://github.com/aws-samples/sample-genai-reflection-for-bedrock.
Ссылки и действия
Дополнительные ресурсы: