Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark
2511.01233v1
cs.CV, cs.GR, cs.HC, I.3; I.2
2025-11-07
Авторы:
Rajmund Nagy, Hendric Voss, Thanh Hoang-Minh, Mihail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, M. Hamza Mughal, Rishabh Dabral, Kiran Chhatre, Christian Theobalt, Libin Liu, Stefan Kopp, Rachel McDonnell, Michael Neff, Taras Kucherenko, Youngwoo Yoon, Gustav Eje Henter
Abstract
We review human evaluation practices in automated, speech-driven 3D gesture
generation and find a lack of standardisation and frequent use of flawed
experimental setups. This leads to a situation where it is impossible to know
how different methods compare, or what the state of the art is. In order to
address common shortcomings of evaluation design, and to standardise future
user studies in gesture-generation works, we introduce a detailed human
evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using
this protocol, we conduct large-scale crowdsourced evaluation to rank six
recent gesture-generation models -- each trained by its original authors --
across two key evaluation dimensions: motion realism and speech-gesture
alignment. Our results provide strong evidence that 1) newer models do not
consistently outperform earlier approaches; 2) published claims of high motion
realism or speech-gesture alignment may not hold up under rigorous evaluation;
and 3) the field must adopt disentangled assessments of motion quality and
multimodal alignment for accurate benchmarking in order to make progress.
Finally, in order to drive standardisation and enable new evaluation research,
we will release five hours of synthetic motion from the benchmarked models;
over 750 rendered video stimuli from the user studies -- enabling new
evaluations without model reimplementation required -- alongside our
open-source rendering script, and the 16,000 pairwise human preference votes
collected for our benchmark.