Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

2511.16540v1 cs.CL, cs.LG 2025-11-21

Авторы:

Éloïse Benito-Rodriguez, Einar Urdshals, Jasmina Nasufi, Nicky Pochinkov

Abstract

Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

Авторы:

Abstract

Ссылки и действия

Связанные статьи

A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Atte...

Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identificatio...

Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Enginee...

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Enhancing Job Matching: Occupation, Skill and Qualification Linking with the ESC...

Навигация