DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

2512.01302v1 cs.CV 2025-12-04

Авторы:

Jaewoo Song, Jooyoung Choi, Kanghyun Baek, Sangyub Lee, Daemin Park, Sungroh Yoon

Abstract

Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

Авторы:

Abstract

Ссылки и действия

Связанные статьи

ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimoda...

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with P...

ViDiC: Video Difference Captioning

Beyond the Ground Truth: Enhanced Supervision for Image Restoration

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task ...

Навигация