FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains

2510.19025v1 cs.DB, cs.AI 2025-10-24

Авторы:

Hamed Jelodar, Samita Bai, Roozbeh Razavi-Far, Ali A. Ghorbani

Abstract

Dataset availability and quality remain critical challenges in machine learning, especially in domains where data are scarce, expensive to acquire, or constrained by privacy regulations. Fields such as healthcare, biomedical research, and cybersecurity frequently encounter high data acquisition costs, limited access to annotated data, and the rarity or sensitivity of key events. These issues-collectively referred to as the dataset challenge-hinder the development of accurate and generalizable machine learning models in such high-stakes domains. To address this, we introduce FlexiDataGen, an adaptive large language model (LLM) framework designed for dynamic semantic dataset generation in sensitive domains. FlexiDataGen autonomously synthesizes rich, semantically coherent, and linguistically diverse datasets tailored to specialized fields. The framework integrates four core components: (1) syntactic-semantic analysis, (2) retrieval-augmented generation, (3) dynamic element injection, and (4) iterative paraphrasing with semantic validation. Together, these components ensure the generation of high-quality, domain-relevant data. Experimental results show that FlexiDataGen effectively alleviates data shortages and annotation bottlenecks, enabling scalable and accurate machine learning model development.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Thucy: An LLM-based Multi-Agent System for Claim Verification across Relational ...

Efficiently Sampling Interval Patterns from Numerical Databases

Beyond Relational: Semantic-Aware Multi-Modal Analytics with LLM-Native Query Op...

AskDB: An LLM Agent for Natural Language Interaction with Relational Databases

Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency...

Навигация