TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation

2508.04058v1 cs.CV 2025-08-09

Авторы:

Zunhui Xia, Hongxing Li, Libin Lan

Резюме на русском

Медицинская изображечная сегментация широко применяется в различных областях, но существует две основные проблемы: высокая вычислительная сложность, особенно для больших последовательностей входных данных, и недостаточная точность в понимании локальных контекстов и многомерных фичей. Чтобы решить эти проблемы, мы предлагаем TCSAFormer — эффективную сеть на основе трансформеров. Основные идеи TCSAFormer заключаются в использовании Compressed Attention (CA) модуля, который объединяет токен-компрессию и пиксельный спарси аттенцион, чтобы фокусироваться на самых важных парах ключ-значение, а также в Dual-Branch Feed-Forward Network (DBFFN), который укрепляет возможности модели в захвате многомерных фичей. Мы проверили TCSAFormer на трех публичных медицинских датасетах, и результаты показали, что сеть превосходит существующие методы по точности, при этом сохраняя меньший вычислительный overhead.

Abstract

In recent years, transformer-based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations. First, their computational complexity scales quadratically with the input sequences. Second, the feed-forward network (FFN) modules in vanilla Transformers typically rely on fully connected layers, which limits models' ability to capture local contextual information and multiscale features critical for precise semantic segmentation. To address these issues, we propose an efficient medical image segmentation network, named TCSAFormer. The proposed TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention (CA) module, which combines token compression and pixel-level sparse attention to dynamically focus on the most relevant key-value pairs for each query. This is achieved by pruning globally irrelevant tokens and merging redundant ones, significantly reducing computational complexity while enhancing the model's ability to capture relationships between tokens. Second, it introduces a Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information, thereby strengthening the model's feature representation capability. We conduct extensive experiments on three publicly available medical image segmentation datasets: ISIC-2018, CVC-ClinicDB, and Synapse, to evaluate the segmentation performance of TCSAFormer. Experimental results demonstrate that TCSAFormer achieves superior performance compared to existing state-of-the-art (SOTA) methods, while maintaining lower computational overhead, thus achieving an optimal trade-off between efficiency and accuracy.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimoda...

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with P...

ViDiC: Video Difference Captioning

Beyond the Ground Truth: Enhanced Supervision for Image Restoration

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task ...

Навигация