FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains
2510.19025v1
cs.DB, cs.AI
2025-10-24
Авторы:
Hamed Jelodar, Samita Bai, Roozbeh Razavi-Far, Ali A. Ghorbani
Abstract
Dataset availability and quality remain critical challenges in machine
learning, especially in domains where data are scarce, expensive to acquire, or
constrained by privacy regulations. Fields such as healthcare, biomedical
research, and cybersecurity frequently encounter high data acquisition costs,
limited access to annotated data, and the rarity or sensitivity of key events.
These issues-collectively referred to as the dataset challenge-hinder the
development of accurate and generalizable machine learning models in such
high-stakes domains. To address this, we introduce FlexiDataGen, an adaptive
large language model (LLM) framework designed for dynamic semantic dataset
generation in sensitive domains. FlexiDataGen autonomously synthesizes rich,
semantically coherent, and linguistically diverse datasets tailored to
specialized fields. The framework integrates four core components: (1)
syntactic-semantic analysis, (2) retrieval-augmented generation, (3) dynamic
element injection, and (4) iterative paraphrasing with semantic validation.
Together, these components ensure the generation of high-quality,
domain-relevant data. Experimental results show that FlexiDataGen effectively
alleviates data shortages and annotation bottlenecks, enabling scalable and
accurate machine learning model development.
Ссылки и действия
Дополнительные ресурсы: