Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

2511.11041v1 cs.CL, cs.AI, cs.LG 2025-11-17

Авторы:

Xingyu Ren, Youran Sun, Haoyu Liang

Abstract

We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector $e$ can be decomposed as $\tilde{e} + μ$, where $μ$ is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 $σ$ on retrieval tasks, 3.1 $σ$ on classification tasks, and 0.8 $σ$ on other types of tasks. Renormalization has two variants: directly subtracting $μ$ from $e$, or subtracting the projection of $e$ onto $μ$. We theoretically predict that the latter performs better, and our experiments confirm this prediction.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Structured Document Translation via Format Reinforcement Learning

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Agreement-Constrained Probabilistic Minimum Bayes Risk Decoding

SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

Навигация