Patient-specific Biomolecular Instruction Tuning
2509.22853v1
q-bio.QM, cs.AI, cs.CL, cs.LG, 92C40, 68T07, 62P10, I.2.7; I.5.1; J.3
2025-10-01
Авторы:
Irsyad Adam, Zekai Chen, David Laub, Shaun Porwal, Arda Pekis, Kevin Brown
Abstract
Proteomics data is essential to pathogenic understanding of a disease
phenotype. In cancer, analysis of molecular signatures enables precision
medicine through the identification of biological processes that drive
individualized tumor progression, therapeutic resistance, and clinical
heterogeneity. Recent advances in multimodal large language models (LLMs) have
shown remarkable capacity to integrate and reason across heterogeneous data
modalities. However, performing multi-modal language modeling for molecular
understanding of patient-specific proteomics remains a significant challenge
due to two barriers: (1) the lack of instruction-tuning datasets that enable
clinical interpretation from proteomics data, and (2) the absence of language
modeling architectures designed to capture the rich heterogeneity of molecular
data. In this work, we introduce CPTAC-PROTSTRUCT, the first instruction tuning
dataset for molecular understanding of oncology, comprising over 400k
open-ended examples derived from individualized proteomic profiles curated from
the largest national proteomics cancer study (CPTAC). Additionally, we propose
KRONOS (Knowledge Representation of patient Omics Networks in Oncology via
Structured tuning), a novel graph-LLM framework that leverages molecular
interaction topology with proteomics to learn patient-specific graph
representations for enhanced clinical reasoning. We show that KRONOS achieves
competitive performance across benchmark clinical tasks, including molecular
classification, temporal trajectory modeling, and tumor stage prediction from
proteomics data. Ultimately, this approach empowers LLMs to understand
patient-level pathogenesis, advancing precision medicine through more accurate
diagnosis, prognosis, and treatment stratification.