From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models
2510.12864v1
cs.AI, cs.CL, cs.LG
2025-10-17
Авторы:
Imran Khan
Abstract
Large Language Models (LLMs) are increasingly being deployed as the reasoning
engines for agentic AI systems, yet they exhibit a critical flaw: a rigid
adherence to explicit rules that leads to decisions misaligned with human
common sense and intent. This "rule-rigidity" is a significant barrier to
building trustworthy autonomous agents. While prior work has shown that
supervised fine-tuning (SFT) with human explanations can mitigate this issue,
SFT is computationally expensive and inaccessible to many practitioners. To
address this gap, we introduce the Rule-Intent Distinction (RID) Framework, a
novel, low-compute meta-prompting technique designed to elicit human-aligned
exception handling in LLMs in a zero-shot manner. The RID framework provides
the model with a structured cognitive schema for deconstructing tasks,
classifying rules, weighing conflicting outcomes, and justifying its final
decision. We evaluated the RID framework against baseline and Chain-of-Thought
(CoT) prompting on a custom benchmark of 20 scenarios requiring nuanced
judgment across diverse domains. Our human-verified results demonstrate that
the RID framework significantly improves performance, achieving a 95% Human
Alignment Score (HAS), compared to 80% for the baseline and 75% for CoT.
Furthermore, it consistently produces higher-quality, intent-driven reasoning.
This work presents a practical, accessible, and effective method for steering
LLMs from literal instruction-following to liberal, goal-oriented reasoning,
paving the way for more reliable and pragmatic AI agents.
Ссылки и действия
Дополнительные ресурсы: