Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings
2510.08774v1
cs.LG, cs.AI, cs.CL
2025-10-14
Авторы:
Shikun Liu, Haoyu Wang, Mufei Li, Pan Li
Abstract
Text embeddings from Large Language Models (LLMs) have become foundational
for numerous applications. However, these models typically operate on raw text,
overlooking the rich structural information, such as hyperlinks or citations,
that provides crucial context in many real-world datasets. This paper
introduces and systematically evaluates a new paradigm for generating
structure-aware text embeddings by integrating these structural relations
directly into the LLM's internal encoding process, rather than relying on
traditional post-hoc aggregation. We investigate two primary in-process
methods: sequential concatenation and parallel caching. Through extensive
zero-shot experiments across retrieval, clustering, classification, and
recommendation tasks, we demonstrate that our structure-aware approaches
consistently outperform both text-only and post-hoc baselines. Our analysis
reveals critical trade-offs: sequential concatenation excels with noisy,
moderate-length contexts, while parallel caching scales more effectively to
long, high-signal contexts but is more susceptible to distractors. To address
the challenge of noisy structural data, we also introduce and validate two
effective techniques: Context Distillation and Semantic Balancing. This work
provides the first comprehensive analysis of in-process structure-aware
encoding, offering a blueprint for building more powerful and contextually
aware embedding models.
Ссылки и действия
Дополнительные ресурсы: