Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR

2508.21693v1 cs.CV, cs.AI, cs.CL, cs.LG 2025-09-02

Авторы:

Shashank Vempati, Nishit Anand, Gaurav Talebailkar, Arpan Garai, Chetan Arora

Резюме на русском

Данная работа определяет проблему неоптимальной точности и эффективности существующих технологий OCR, связанных с ошибками в квадратурном сегментации слов. Авторы предлагают перейти от строчного к линейному сегментированию, используя модели перевода последовательностей для распознавания целых строк. Эта стратегия обходит ошибки в детектировании слов и позволяет использовать более эффективные модели естественных языков для повышения точности. Авторы представляют собственный датасет с 251 изображениями страниц для обучения и тестирования, доказав улучшение точности на 5,4% и эффективность на 4 раз по сравнению с существующими подходами. Результаты указывают на перспективу такого подхода для документов и перспективы его улучшения в сочетании с ростом мощности технологий LLMs.

Abstract

Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Sel...

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmente...

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-bas...

Навигация