Publication Trend Analysis and Synthesis via Large Language Model: A Case Study of Engineering in PNAS
2510.16152v1
cs.DL, cs.AI, cs.CL, cs.LG
2025-10-22
Авторы:
Mason Smetana, Lev Khazanovich
Abstract
Scientific literature is increasingly siloed by complex language, static
disciplinary structures, and potentially sparse keyword systems, making it
cumbersome to capture the dynamic nature of modern science. This study
addresses these challenges by introducing an adaptable large language model
(LLM)-driven framework to quantify thematic trends and map the evolving
landscape of scientific knowledge. The approach is demonstrated over a 20-year
collection of more than 1,500 engineering articles published by the Proceedings
of the National Academy of Sciences (PNAS), marked for their breadth and depth
of research focus. A two-stage classification pipeline first establishes a
primary thematic category for each article based on its abstract. The
subsequent phase performs a full-text analysis to assign secondary
classifications, revealing latent, cross-topic connections across the corpus.
Traditional natural language processing (NLP) methods, such as Bag-of-Words
(BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), confirm the
resulting topical structure and also suggest that standalone word-frequency
analyses may be insufficient for mapping fields with high diversity. Finally, a
disjoint graph representation between the primary and secondary classifications
reveals implicit connections between themes that may be less apparent when
analyzing abstracts or keywords alone. The findings show that the approach
independently recovers much of the journal's editorially embedded structure
without prior knowledge of its existing dual-classification schema (e.g.,
biological studies also classified as engineering). This framework offers a
powerful tool for detecting potential thematic trends and providing a
high-level overview of scientific progress.