The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining

2510.00866v2 cs.LG, cs.CL 2025-10-04

Авторы:

Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, Pierre Ablin

Abstract

Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier's score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF captures a meaningful notion of data quality.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Towards Active Synthetic Data Generation for Finetuning Language Models

AlignSAE: Concept-Aligned Sparse Autoencoders

Measuring What LLMs Think They Do: SHAP Faithfulness and Deployability on Financ...

BanglaSentNet: An Explainable Hybrid Deep Learning Framework for Multi-Aspect Se...

Навигация