Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

2510.16590v1 cs.LG, cs.AI, q-bio.BM 2025-10-22

Авторы:

Alan Kai Hassen, Andrius Bernatavicius, Antonius P. A. Janssen, Mike Preuss, Gerard J. P. van Westen, Djork-Arné Clevert

Abstract

Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring labeled training data. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a one-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and final reactants ($\geq74\%$). Beyond solving complex chemical tasks, our work also provides a method to generate theoretically grounded synthetic datasets by mapping chemical knowledge onto the molecular structure and thereby addressing data scarcity.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Me...

STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular G...

Protein as a Second Language for LLMs

From Supervision to Exploration: What Does Protein Language Model Learn During R...

A Foundation Chemical Language Model for Comprehensive Fragment-Based Drug Disco...

Навигация