Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

2509.26388v1 eess.AS, cs.AI, cs.CL 2025-10-02

Авторы:

Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, James Glass

Abstract

Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities, Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses. Our evaluation of diverse SLM architectures reveals a clear performance disparity: while state-of-the-art models handle basic tasks well, many contemporary systems still struggle with fundamental instruction-following. More critically, nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI. Demos and datasets are available on our project website https://ga642381.github.io/Game-Time.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Towards Audio Token Compression in Large Audio Language Models

Unifying Model and Layer Fusion for Speech Foundation Models

Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech...

StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Trans...

ARTI-6: Towards Six-dimensional Articulatory Speech Encoding

Навигация