SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling
2510.04286v1
cs.CL, cs.AI, cs.LG
2025-10-08
Авторы:
Harshil Vejendla
Abstract
Mixture-of-Experts (MoE) layers scale transformers by routing tokens to a
sparse subset of feed-forward experts. Token-level routing, however, assigns an
entire semantic spectrum to each expert, creating capacity bottlenecks,
load-balancing pathologies, and limited specialization. We introduce SliceMoE,
an architecture that routes contiguous slices of a token's hidden vector. A
d-dimensional embedding is partitioned into S slices, and for each slice, a
lightweight shared router predicts the top-k experts. Experts operate on their
assigned slices independently, and outputs are reassembled, maintaining
per-token FLOP efficiency. Because slices from different tokens interleave
within an expert, utilization is naturally smoother. We propose a slice-level
capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels.
Experiments on WikiText-103 language modeling, WMT En-De translation, and three
text-classification datasets show SliceMoE attains up to 1.7x faster inference
than dense baselines, 12 to 18 percent lower perplexity than parameter-matched
token-MoE, and improved expert balance, with interpretable expertise over
syntactic versus semantic subspaces.
Ссылки и действия
Дополнительные ресурсы: