Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
2510.24817v2
cs.CL, cs.AI, cs.LG
2025-10-31
Авторы:
Jason M. Pittman, Anton Phillips Jr., Yesenia Medina-Santos, Brielle C. Stark
Abstract
In aphasia research, Speech-Language Pathologists (SLPs) devote extensive
time to manually coding speech samples using Correct Information Units (CIUs),
a measure of how informative an individual sample of speech is. Developing
automated systems to recognize aphasic language is limited by data scarcity.
For example, only about 600 transcripts are available in AphasiaBank yet
billions of tokens are used to train large language models (LLMs). In the
broader field of machine learning (ML), researchers increasingly turn to
synthetic data when such are sparse. Therefore, this study constructs and
validates two methods to generate synthetic transcripts of the AphasiaBank Cat
Rescue picture description task. One method leverages a procedural programming
approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct
LLMs. The methods generate transcripts across four severity levels (Mild,
Moderate, Severe, Very Severe) through word dropping, filler insertion, and
paraphasia substitution. Overall, we found, compared to human-elicited
transcripts, Mistral 7b Instruct best captures key aspects of linguistic
degradation observed in aphasia, showing realistic directional changes in NDW,
word count, and word length amongst the synthetic generation methods. Based on
the results, future work should plan to create a larger dataset, fine-tune
models for better aphasic representation, and have SLPs assess the realism and
usefulness of the synthetic transcripts.
Ссылки и действия
Дополнительные ресурсы: