MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization

2511.02400v1 eess.IV, cs.AI, cs.CV, cs.LG 2025-11-06

Авторы:

Yalda Zafari, Hongyi Pan, Gorkem Durak, Ulas Bagci, Essam A. Rashed, Mohamed Mabrok

Abstract

The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Comparing Baseline and Day-1 Diffusion MRI Using Multimodal Deep Embeddings for ...

MedCondDiff: Lightweight, Robust, Semantically Guided Diffusion for Medical Imag...

A Deep Learning Framework for Thyroid Nodule Segmentation and Malignancy Classif...

Large-scale modality-invariant foundation models for brain MRI analysis: Applica...

TraceTrans: Translation and Spatial Tracing for Surgical Prediction

Навигация