Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
2510.04772v1
cs.CV, cs.LG
2025-10-08
Авторы:
Max Kirchner, Hanna Hoffmann, Alexander C. Jenke, Oliver L. Saldanha, Kevin Pfeiffer, Weam Kanjo, Julia Alekseenko, Claas de Boer, Santhi Raj Kolamuri, Lorenzo Mazza, Nicolas Padoy, Sophia Bano, Annika Reinke, Lena Maier-Hein, Danail Stoyanov, Jakob N. Kather, Fiona R. Kolbinger, Sebastian Bodenstedt, Stefanie Speidel
Abstract
Purpose: The FedSurg challenge was designed to benchmark the state of the art
in federated learning for surgical video classification. Its goal was to assess
how well current methods generalize to unseen clinical centers and adapt
through local fine-tuning while enabling collaborative model development
without sharing patient data. Methods: Participants developed strategies to
classify inflammation stages in appendicitis using a preliminary version of the
multi-center Appendix300 video dataset. The challenge evaluated two tasks:
generalization to an unseen center and center-specific adaptation after
fine-tuning. Submitted approaches included foundation models with linear
probing, metric learning with triplet loss, and various FL aggregation schemes
(FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and
Expected Cost, with ranking robustness evaluated via bootstrapping and
statistical testing. Results: In the generalization task, performance across
centers was limited. In the adaptation task, all teams improved after
fine-tuning, though ranking stability was low. The ViViT-based submission
achieved the strongest overall performance. The challenge highlighted
limitations in generalization, sensitivity to class imbalance, and difficulties
in hyperparameter tuning in decentralized training, while spatiotemporal
modeling and context-aware preprocessing emerged as promising strategies.
Conclusion: The FedSurg Challenge establishes the first benchmark for
evaluating FL strategies in surgical video classification. Findings highlight
the trade-off between local personalization and global robustness, and
underscore the importance of architecture choice, preprocessing, and loss
design. This benchmarking offers a reference point for future development of
imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.
Ссылки и действия
Дополнительные ресурсы: