NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations

2511.18793v1 cs.AI, cs.LG 2025-11-26

Авторы:

Yejing Wang, Shengyu Zhou, Jinyu Lu, Ziwei Liu, Langming Liu, Maolin Wang, Wenlin Zhang, Feng Li, Wenbo Su, Pengjie Wang, Jian Xu, Xiangyu Zhao

Abstract

Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, which makes them infeasible for high-throughput, real-time services and limits their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, requiring additional training and increasing the latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary model, enabling efficient self-drafting. This design, combined with a specialized input prompt structure, preserves the integrity of sequence-to-sequence generation. Furthermore, to tackle the critical problem of hallucination, a major source of performance degradation, we introduce an efficient, model-free verifier based on a hash set. We demonstrate the effectiveness of NEZHA through extensive experiments on public datasets and have successfully deployed the system on Taobao since October 2025, driving the billion-level advertising revenue and serving hundreds of millions of daily active users.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Mathematical Framing for Different Agent Strategies

Sequential Enumeration in Large Language Models

Educational Cone Model in Embedding Vector Spaces

A Benchmark of Causal vs Correlation AI for Predictive Maintenance

fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Deco...

Навигация