Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

2510.18358v1 cs.LG, cs.CV 2025-10-23

Авторы:

Firas Gabetni, Giuseppe Curci, Andrea Pilzer, Subhankar Roy, Elisa Ricci, Gianni Franchi

Abstract

Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient transformer-based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of pruning, showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our approach surpasses state of the art methods, even without requiring additional training.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Studying Various Activation Functions and Non-IID Data for Machine Learning Mode...

Feature Engineering vs. Deep Learning for Automated Coin Grading: A Comparative ...

Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspecti...

Value Gradient Guidance for Flow Matching Alignment

Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

Навигация