Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting
2510.00982v1
eess.AS, cs.CL, cs.SD
2025-10-04
Авторы:
Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe
Abstract
For streaming speech recognition, a Transformer-based encoder has been widely
used with block processing. Although many studies addressed improving emission
latency of transducers, little work has been explored for improving encoding
latency of the block processing. We seek to reduce latency by frequently
emitting a chunk with a small shift rather than scarce large-chunk emissions,
resulting in higher computational costs. To efficiently compute with the small
chunk shift, we propose a new encoder, Spiralformer, tailored for block
processing by combining layer dropping and early exiting. We skip layer
computation in a cyclic manner and shift the computed layer in each block
spirally, which completes computation for all the layers over the block
processing. Experimentally, we observed that our method achieved 21.6%
reduction in the averaged token emission delay in Librispeech, and 7.0% in CSJ,
compared with the baseline with similar computational cost and word error
rates.
Ссылки и действия
Дополнительные ресурсы: