A Multicentric Dataset for Training and Benchmarking Breast Cancer Segmentation in H&E Slides
2510.02037v1
q-bio.QM, cs.CV, eess.IV
2025-10-04
Авторы:
Carlijn Lems, Leslie Tessier, John-Melle Bokhorst, Mart van Rijthoven, Witali Aswolinskiy, Matteo Pozzi, Natalie Klubickova, Suzanne Dintzis, Michela Campora, Maschenka Balkenhol, Peter Bult, Joey Spronck, Thomas Detone, Mattia Barbareschi, Enrico Munari, Giuseppe Bogina, Jelle Wesseling, Esther H. Lips, Francesco Ciompi, Frédérique Meeuwsen, Jeroen van der Laak
Abstract
Automated semantic segmentation of whole-slide images (WSIs) stained with
hematoxylin and eosin (H&E) is essential for large-scale artificial
intelligence-based biomarker analysis in breast cancer. However, existing
public datasets for breast cancer segmentation lack the morphological diversity
needed to support model generalizability and robust biomarker validation across
heterogeneous patient cohorts. We introduce BrEast cancEr hisTopathoLogy
sEgmentation (BEETLE), a dataset for multiclass semantic segmentation of
H&E-stained breast cancer WSIs. It consists of 587 biopsies and resections from
three collaborating clinical centers and two public datasets, digitized using
seven scanners, and covers all molecular subtypes and histological grades.
Using diverse annotation strategies, we collected annotations across four
classes - invasive epithelium, non-invasive epithelium, necrosis, and other -
with particular focus on morphologies underrepresented in existing datasets,
such as ductal carcinoma in situ and dispersed lobular tumor cells. The
dataset's diversity and relevance to the rapidly growing field of automated
biomarker quantification in breast cancer ensure its high potential for reuse.
Finally, we provide a well-curated, multicentric external evaluation set to
enable standardized benchmarking of breast cancer segmentation models.