ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval
2510.08252v1
cs.IR, cs.CL
2025-10-11
Авторы:
Jianlyu Chen, Junwei Lan, Chaofan Li, Defu Lian, Zheng Liu
Abstract
In this paper, we introduce ReasonEmbed, a novel text embedding model
developed for reasoning-intensive document retrieval. Our work includes three
key technical contributions. First, we propose ReMixer, a new data synthesis
method that overcomes the triviality problem prevalent in previous synthetic
datasets, enabling large-scale production of 82K high-quality training samples.
Second, we design Redapter, a self-adaptive learning algorithm that dynamically
adjusts training each sample's weight based on its reasoning intensity. This
allows the model to effectively capture the complex semantic relationships
between queries and documents. Third, we implement ReasonEmbed across multiple
backbones of varying sizes, all of which achieve superior performance on
reasoning-intensive retrieval tasks. Notably, our ReasonEmbed-Qwen3-8B model
offers a record-high nDCG@10 score of 38.1 on the BRIGHT benchmark, which
significantly outperforms existing text embedding models. We will fully
open-source our created resources in ReasonEmbed to push forward the research
advancement in this field.
Ссылки и действия
Дополнительные ресурсы: