Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis

2510.07096v1 cs.CL, cs.SD, eess.AS 2025-10-10

Авторы:

Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

Abstract

Sarcasm is a subtle form of non-literal language that poses significant challenges for speech synthesis due to its reliance on nuanced semantic, contextual, and prosodic cues. While existing speech synthesis research has focused primarily on broad emotional categories, sarcasm remains largely unexplored. In this paper, we propose a Large Language Model (LLM)-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis. Our approach combines (1) semantic embeddings from a LoRA-fine-tuned LLaMA 3, which capture pragmatic incongruity and discourse-level cues of sarcasm, and (2) prosodic exemplars retrieved via a Retrieval Augmented Generation (RAG) module, which provide expressive reference patterns of sarcastic delivery. Integrated within a VITS backbone, this dual conditioning enables more natural and contextually appropriate sarcastic speech. Experiments demonstrate that our method outperforms baselines in both objective measures and subjective evaluations, yielding improvements in speech naturalness, sarcastic expressivity, and downstream sarcasm detection.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tr...

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Bas...

Proactive Hearing Assistants that Isolate Egocentric Conversations

Hallucination Benchmark for Speech Foundation Models

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Predic...

Навигация