EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans

2512.01340v1 cs.CV 2025-12-04

Авторы:

Yingjie Zhou, Xilei Zhu, Siyu Ren, Ziyi Zhao, Ziwen Wang, Farong Wen, Yu Zhou, Jiezhang Cao, Xiongkuo Min, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu

Abstract

Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore, we introduce EvalTalker, a novel TH quality assessment framework. This framework possesses the ability to perceive global quality, human characteristics, and identity consistency, while integrating Qwen-Sync to perceive multimodal synchrony. Experimental results demonstrate that EvalTalker achieves superior correlation with subjective scores, providing a robust foundation for future research on high-quality Multi-Talker generation and evaluation.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans

Авторы:

Abstract

Ссылки и действия

Связанные статьи

ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimoda...

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with P...

ViDiC: Video Difference Captioning

Beyond the Ground Truth: Enhanced Supervision for Image Restoration

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task ...

Навигация