📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 0

Последнее обновление: сегодня

📄 A U-Net and Transformer Pipeline for Multilingual Image Translation

2025-10-29

Авторы:

Siddharth Sahay, Radhika Agarwal

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted tex...

ID: 2510.23554v1 cs.LG, cs.CL, cs.CV

arXiv PDF

📄 DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry

2025-10-29

Авторы:

Changti Wu, Shijie Lian, Zihao Liu, Lei Zhang, Laurence Tianruo Yang, Kai Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Solid geometry problem solving demands spatial mathematical reasoning that integrates spatial intelligence and symbolic reasoning. However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. To address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spati...

ID: 2510.22340v1 cs.AI, cs.CL, cs.CV, cs.LG

arXiv PDF

📄 JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

2025-10-29

Авторы:

Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make c...

ID: 2510.23538v1 cs.AI, cs.CL, cs.CV, cs.SE

arXiv PDF

📄 FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models

2025-10-28

Авторы:

Zihao Fu, Ryan Brown, Shun Shao, Kai Rawal, Eoin Delaney, Chris Russell

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Text-to-image diffusion models, such as Stable Diffusion, have demonstrated remarkable capabilities in generating high-quality and diverse images from natural language prompts. However, recent studies reveal that these models often replicate and amplify societal biases, particularly along demographic attributes like gender and race. In this paper, we introduce FairImagen (https://github.com/fuzihaofzh/FairImagen), a post-hoc debiasing framework that operates on prompt embeddings to mitigate such...

ID: 2510.21363v1 cs.LG, cs.CL, cs.CV

arXiv PDF

📄 Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures

2025-10-25

Авторы:

Rahul Raja, Arpita Vats

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval meth...

ID: 2510.20193v1 cs.IR, cs.CL, cs.CV, cs.LG

arXiv PDF

📄 Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples

2025-10-25

Авторы:

Shiva Sreeram, Alaa Maalouf, Pratyusha Sharma, Daniela Rus

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of mat...

ID: 2510.20800v1 cs.LG, cs.AI, cs.CL, cs.CV

arXiv PDF

📄 Real Deep Research for AI, Robotics and Beyond

2025-10-25

Авторы:

Xueyan Zou, Jianglong Ye, Hao Zhang, Xiaoyu Xiang, Mingyu Ding, Zhaojing Yang, Yong Jae Lee, Zhuowen Tu, Sifei Liu, Xiaolong Wang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

With the rapid growth of research in AI and robotics now producing over 10,000 papers annually it has become increasingly difficult for researchers to stay up to date. Fast evolving trends, the rise of interdisciplinary work, and the need to explore domains beyond one's expertise all contribute to this challenge. To address these issues, we propose a generalizable pipeline capable of systematically analyzing any research area: identifying emerging trends, uncovering cross domain opportunities, a...

ID: 2510.20809v1 cs.AI, cs.CL, cs.CV, cs.LG

arXiv PDF

📄 Spatio-temporal Sign Language Representation and Translation

2025-10-24

Авторы:

Yasser Hamidullah, Josef van Genabith, Cristina España-Bonet

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that ...

ID: 2510.19413v1 cs.CL, cs.CV

arXiv PDF

📄 Metrics and evaluations for computational and sustainable AI efficiency

2025-10-23

Авторы:

Hongyuan Liu, Xinyang Liu, Guosheng Hu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The rapid advancement of Artificial Intelligence (AI) has created unprecedented demands for computational power, yet methods for evaluating the performance, efficiency, and environmental impact of deployed models remain fragmented. Current approaches often fail to provide a holistic view, making it difficult to compare and optimise systems across heterogeneous hardware, software stacks, and numeric precisions. To address this gap, we propose a unified and reproducible methodology for AI model in...

ID: 2510.17885v1 cs.PF, cs.AI, cs.CL, cs.CV

arXiv PDF

📄 From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

2025-10-22

Авторы:

Zefan Cai, Haoyi Qiu, Haozhe Zhao, Ke Wan, Jiachen Li, Jiuxiang Gu, Wen Xiao, Nanyun Peng, Junjie Hu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded...

ID: 2510.17247v1 cs.CL, cs.CV

arXiv PDF

Показано 41 - 50 из 162 записей