SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations
2510.04398v1
cs.CL, cs.AI, cs.CR, cs.LG
2025-10-08
Авторы:
Buyun Liang, Liangzu Peng, Jinqi Luo, Darshan Thaker, Kwan Ho Ryan Chan, René Vidal
Abstract
Large Language Models (LLMs) are increasingly deployed in high-risk domains.
However, state-of-the-art LLMs often produce hallucinations, raising serious
concerns about their reliability. Prior work has explored adversarial attacks
for hallucination elicitation in LLMs, but it often produces unrealistic
prompts, either by inserting gibberish tokens or by altering the original
meaning. As a result, these approaches offer limited insight into how
hallucinations may occur in practice. While adversarial attacks in computer
vision often involve realistic modifications to input images, the problem of
finding realistic adversarial prompts for eliciting LLM hallucinations has
remained largely underexplored. To address this gap, we propose Semantically
Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic
modifications to the prompt that preserve its meaning while maintaining
semantic coherence. Our contributions are threefold: (i) we formulate finding
realistic attacks for hallucination elicitation as a constrained optimization
problem over the input prompt space under semantic equivalence and coherence
constraints; (ii) we introduce a constraint-preserving zeroth-order method to
effectively search for adversarial yet feasible prompts; and (iii) we
demonstrate through experiments on open-ended multiple-choice question
answering tasks that SECA achieves higher attack success rates while incurring
almost no constraint violations compared to existing methods. SECA highlights
the sensitivity of both open-source and commercial gradient-inaccessible LLMs
to realistic and plausible prompt variations. Code is available at
https://github.com/Buyun-Liang/SECA.