MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
2511.02400v1
eess.IV, cs.AI, cs.CV, cs.LG
2025-11-06
Авторы:
Yalda Zafari, Hongyi Pan, Gorkem Durak, Ulas Bagci, Essam A. Rashed, Mohamed Mabrok
Abstract
The development of clinically reliable artificial intelligence (AI) systems
for mammography is hindered by profound heterogeneity in data quality, metadata
standards, and population distributions across public datasets. This
heterogeneity introduces dataset-specific biases that severely compromise the
generalizability of the model, a fundamental barrier to clinical deployment. We
present MammoClean, a public framework for standardization and bias
quantification in mammography datasets. MammoClean standardizes case selection,
image processing (including laterality and intensity correction), and unifies
metadata into a consistent multi-view structure. We provide a comprehensive
review of breast anatomy, imaging characteristics, and public mammography
datasets to systematically identify key sources of bias. Applying MammoClean to
three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify
substantial distributional shifts in breast density and abnormality prevalence.
Critically, we demonstrate the direct impact of data corruption: AI models
trained on corrupted datasets exhibit significant performance degradation
compared to their curated counterparts. By using MammoClean to identify and
mitigate bias sources, researchers can construct unified multi-dataset training
corpora that enable development of robust models with superior cross-domain
generalization. MammoClean provides an essential, reproducible pipeline for
bias-aware AI development in mammography, facilitating fairer comparisons and
advancing the creation of safe, effective systems that perform equitably across
diverse patient populations and clinical settings. The open-source code is
publicly available from: https://github.com/Minds-R-Lab/MammoClean.