The Effect of Attention Head Count on Transformer Approximation
2510.06662v1
cs.LG, stat.ML
2025-10-12
Авторы:
Penghao Yu, Haotian Jiang, Zeyu Bao, Ruoxi Yu, Qianxiao Li
Abstract
Transformer has become the dominant architecture for sequence modeling, yet a
detailed understanding of how its structural parameters influence expressive
power remains limited. In this work, we study the approximation properties of
transformers, with particular emphasis on the role of the number of attention
heads. Our analysis begins with the introduction of a generalized $D$-retrieval
task, which we prove to be dense in the space of continuous functions, thereby
providing the basis for our theoretical framework. We then establish both upper
and lower bounds on the parameter complexity required for
$\epsilon$-approximation. Specifically, we show that transformers with
sufficiently many heads admit efficient approximation, whereas with too few
heads, the number of parameters must scale at least as $O(1/\epsilon^{cT})$,
for some constant $c$ and sequence length $T$. To the best of our knowledge,
this constitutes the first rigorous lower bound of this type in a nonlinear and
practically relevant setting. We further examine the single-head case and
demonstrate that an embedding dimension of order $O(T)$ allows complete
memorization of the input, where approximation is entirely achieved by the
feed-forward block. Finally, we validate our theoretical findings with
experiments on both synthetic data and real-world tasks, illustrating the
practical relevance of our results.
Ссылки и действия
Дополнительные ресурсы: