Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis
2510.07096v1
cs.CL, cs.SD, eess.AS
2025-10-10
Авторы:
Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler
Abstract
Sarcasm is a subtle form of non-literal language that poses significant
challenges for speech synthesis due to its reliance on nuanced semantic,
contextual, and prosodic cues. While existing speech synthesis research has
focused primarily on broad emotional categories, sarcasm remains largely
unexplored. In this paper, we propose a Large Language Model (LLM)-enhanced
Retrieval-Augmented framework for sarcasm-aware speech synthesis. Our approach
combines (1) semantic embeddings from a LoRA-fine-tuned LLaMA 3, which capture
pragmatic incongruity and discourse-level cues of sarcasm, and (2) prosodic
exemplars retrieved via a Retrieval Augmented Generation (RAG) module, which
provide expressive reference patterns of sarcastic delivery. Integrated within
a VITS backbone, this dual conditioning enables more natural and contextually
appropriate sarcastic speech. Experiments demonstrate that our method
outperforms baselines in both objective measures and subjective evaluations,
yielding improvements in speech naturalness, sarcastic expressivity, and
downstream sarcasm detection.
Ссылки и действия
Дополнительные ресурсы: