DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization
2510.05858v3
cs.CL, cs.AI, cs.LG
2025-10-10
Авторы:
Xue-Yong Fu, Elena Khasanova, Md Tahmid Rahman Laskar, Harsh Saini, Shashi Bhushan TN
Abstract
Large language models (LLMs) have achieved impressive performance in text
summarization, yet their performance often falls short when applied to
specialized domains that differ from their original pre-training distribution.
While fine-tuning can improve summarization quality, it typically relies on
costly and scarce high-quality labeled data. In this work, we explore continual
pre-training as a scalable, self-supervised approach to adapt LLMs for
downstream summarization tasks, particularly in the context of noisy real-world
conversation transcripts. We conduct extensive experiments using large-scale,
unlabeled business conversation data to investigate whether continual
pre-training enhances model capabilities in conversational summarization. Our
results demonstrate that continual pre-training yields substantial gains in
both in-domain and out-of-domain summarization benchmarks, while maintaining
strong generalization and robustness. We also analyze the effects of data
selection strategies, providing practical guidelines for applying continual
pre-training in summarization-focused industrial applications.
Ссылки и действия
Дополнительные ресурсы: