Protein as a Second Language for LLMs

2510.11188v1 cs.LG, cs.AI, q-bio.BM 2025-10-15

Авторы:

Xinhui Chen, Zuchao Li, Mengqi Gao, Yufeng Zhang, Chak Tou Leong, Haoyang Li, Jiaqi Chen

Abstract

Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Protein as a Second Language for LLMs

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Me...

STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular G...

Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

From Supervision to Exploration: What Does Protein Language Model Learn During R...

A Foundation Chemical Language Model for Comprehensive Fragment-Based Drug Disco...

Навигация