AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs
2509.25570v1
cs.CV, cs.AI, eess.IV
2025-10-02
Авторы:
Hakan Emre Gedik, Andrew Martin, Mustafa Munir, Oguzhan Baser, Radu Marculescu, Sandeep P. Chinchali, Alan C. Bovik
Abstract
Vision Graph Neural Networks (ViGs) have demonstrated promising performance
in image recognition tasks against Convolutional Neural Networks (CNNs) and
Vision Transformers (ViTs). An essential part of the ViG framework is the
node-neighbor feature aggregation method. Although various graph convolution
methods, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, have been
explored, a versatile aggregation method that effectively captures complex
node-neighbor relationships without requiring architecture-specific refinements
is needed. To address this gap, we propose a cross-attention-based aggregation
method in which the query projections come from the node, while the key
projections come from its neighbors. Additionally, we introduce a novel
architecture called AttentionViG that uses the proposed cross-attention
aggregation scheme to conduct non-local message passing. We evaluated the image
recognition performance of AttentionViG on the ImageNet-1K benchmark, where it
achieved SOTA performance. Additionally, we assessed its transferability to
downstream tasks, including object detection and instance segmentation on MS
COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate
that the proposed method not only achieves strong performance, but also
maintains efficiency, delivering competitive accuracy with comparable FLOPs to
prior vision GNN architectures.
Ссылки и действия
Дополнительные ресурсы: