EMR-AGENT: Automating Cohort and Feature Extraction from EMR Databases
2510.00549v2
cs.DB, cs.AI, I.2.7; H.2.8
2025-10-04
Авторы:
Kwanhyung Lee, Sungsoo Hong, Joonhyung Park, Jeonghyeop Lim, Juhwan Choi, Donghwee Yoon, Eunho Yang
Abstract
Machine learning models for clinical prediction rely on structured data
extracted from Electronic Medical Records (EMRs), yet this process remains
dominated by hardcoded, database-specific pipelines for cohort definition,
feature selection, and code mapping. These manual efforts limit scalability,
reproducibility, and cross-institutional generalization. To address this, we
introduce EMR-AGENT (Automated Generalized Extraction and Navigation Tool), an
agent-based framework that replaces manual rule writing with dynamic, language
model-driven interaction to extract and standardize structured clinical data.
Our framework automates cohort selection, feature extraction, and code mapping
through interactive querying of databases. Our modular agents iteratively
observe query results and reason over schema and documentation, using SQL not
just for data retrieval but also as a tool for database observation and
decision making. This eliminates the need for hand-crafted, schema-specific
logic. To enable rigorous evaluation, we develop a benchmarking codebase for
three EMR databases (MIMIC-III, eICU, SICdb), including both seen and unseen
schema settings. Our results demonstrate strong performance and generalization
across these databases, highlighting the feasibility of automating a process
previously thought to require expert-driven design. The code will be released
publicly at https://github.com/AITRICS/EMR-AGENT/tree/main. For a
demonstration, please visit our anonymous demo page:
https://anonymoususer-max600.github.io/EMR_AGENT/