📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 0

Последнее обновление: сегодня

📄 MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark

2025-12-03

Авторы:

Yuezhang Peng, Chonghao Cai, Ziang Liu, Shuai Fan, Sheng Jiang, Hua Xu, Yuxin Liu, Qiguang Chen, Kele Xu, Yao Li, Sheng Wang, Libo Qin, Xie Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficul...

ID: 2512.01603v1 cs.CL, cs.MM

arXiv PDF

📄 ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

2025-12-02

Авторы:

Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches ...

ID: 2511.22715v1 cs.CV, cs.AI, cs.CL, cs.MM

arXiv PDF

📄 MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core

2025-11-25

Авторы:

Callie C. Liao, Duoduo Liao, Ellie L. Zhang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright infringement risks. The music core algorithms connect critical lyrical and rhythmic information to au...

ID: 2511.17323v1 cs.SD, cs.AI, cs.CL, cs.MM

arXiv PDF

📄 HI-TransPA: Hearing Impairments Translation Personal Assistant

2025-11-18

Авторы:

Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan, Peidong Wang, Mingjun Pan, Yuhao Mo, Jiajie Cheng, Chengxin Chen, Zhonglun Cao, Chonghan Liu, Shi Cheng

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired...

ID: 2511.09915v2 cs.CL, cs.MM, cs.SD

arXiv PDF

📄 HI-TransPA: Hearing Impairments Translation Personal Assistant

2025-11-15

Авторы:

Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan, Peidong Wang, Mingjun Pan, Yuhao Mo, Jiajie Cheng, Chengxin Chen, Zhonglun Cao, Chonghan Liu, Shi Cheng

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Mo...

ID: 2511.09915v1 cs.CL, cs.MM, cs.SD

arXiv PDF

📄 MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation

2025-11-08

Авторы:

Shih-Lun Wu, Yoon Kim, Cheng-Zhi Anna Huang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM's vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM's parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live d...

ID: 2511.03942v1 cs.SD, cs.CL, cs.MM

arXiv PDF

📄 Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

2025-10-18

Авторы:

Ryo Masumura, Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Naoki Makishima, Taiga Yamane, Naotaka Kawata, Satoshi Suzuki, Taichi Katayama

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dom...

ID: 2510.14203v1 cs.CV, cs.CL, cs.MM

arXiv PDF

📄 MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding

2025-10-07

Авторы:

Jingyuan Deng, Yujiu Yang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention ...

ID: 2510.02790v1 cs.CV, cs.AI, cs.CL, cs.MM

arXiv PDF

📄 FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos

2025-10-02

Авторы:

Siddhant Sukhani, Yash Bhardwaj, Riya Bhadani, Veer Kejriwal, Michael Galarnyk, Sudheer Chava

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financial entity recognition. Video alone performs strongly on four of five topics, underscoring its value ...

ID: 2509.25745v1 cs.CV, cs.CL, cs.MM

arXiv PDF

📄 RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks

2025-10-01

Авторы:

Amit Agarwal, Hitesh Laxmichand Patel, Srikant Panda, Hansa Meghwani, Jyotika Singh, Karan Dua, Paul Li, Tao Sheng, Sujith Ravi, Dan Roth

## Контекст Multimodal знания, объединяющие визуальные и текстовые данные, стали важной областью исследований в искусственном интеллекте. Многие Multimodal Large Language Models (MLLMs) показали выдающиеся результаты на визуально-языковых бенчмарках. Однако, существует сомнение в том, насколько эти бенчмарки оценивают настоящую возможность глобального логического рассуждения или разрешают достижение успеха через локальные визуальные признаки. На данный момент, существующие методы оценки неявно измеряют это различие, что способствует субъективному выбору данных и ограничивает потенциал моделей в реальных мировых сценариях. ## Метод Региональный Разумеющий Индекс (RCI) — первый модельный подход, который измеряет значимость глобального и локального визуального смысла в задаче. Он сравнивает производительность модели на изображениях и их отдельных частях, выявляя наличие зависимости от глобальных или локальных признаков. RCI использует референтную модель для сравнения производительности на изображениях и их частях, чтобы определить, требуют ли задачи глобального понимания или могут быть решены локальными признаками. ## Результаты При проверке RCI на 13 широко используемых визуально-языковых бенчмарках было выявлено, что большинство из них призначивают локальные признаки, что приводит к сильной зависимости от пространственных признаков. Это может привести к нежелательным последствиям в реальных мировых сценариях. Таким образом, RCI оказывается важной инструментом для диагностики и устранения этих проблем, что позволяет создавать более балансированные бенчмарки и развивать реальности-нацеленные модели. ## Значимость RCI может применяться в широком спектре приложений, включая диагностику проблем в текущих бенчмарках, оптимизацию точности моделей и развитие бенчмарков, которые стимулируют развитие реальности-нацеленных моделей. Он обеспечивает практический подход для создания более значимых и реалистичных бенчмарков, которые будут улучшать возможности моделей в реальных мировых сценариях. ## Выводы Результаты показали, что RCI является эффективным инструментом для измерения глобального и локального рассуждения в визуально-языковых моделях. Он определяет наличие проблем в текущих бенчмарках и призван помочь разработчикам создавать более адекватные, глобально-ориентированные модели. Будущие исследования будут сфокусированы на расширении RCI для других типов бенчмарков и его использовании в развитии многорежимных моделей с более высокой универсальностью.

Annotation:

Multimodal Large Language Models (MLLMs) have achieved impressive results on vision-language benchmarks, yet it remains unclear whether these benchmarks assess genuine global reasoning or allow success via localized visual cues. Existing evaluation methods do not explicitly measure this distinction, hindering effective dataset curation and real-world focused model development. We introduce Region Comprehension Index (RCI), the first model-based score to directly quantify a dataset's reliance o...

ID: 2509.23673v1 cs.CV, cs.AI, cs.CL, cs.MM, 68T45, 68T50, I.2.7; I.2.10; I.4.7; I.4.8

arXiv PDF

Показано 1 - 10 из 28 записей