Bias-Corrected Data Synthesis for Imbalanced Learning
2510.26046v1
stat.ML, cs.LG, stat.ME
2025-11-01
Авторы:
Pengfei Lyu, Zhengchi Ma, Linjun Zhang, Anru R. Zhang
Abstract
Imbalanced data, where the positive samples represent only a small proportion
compared to the negative samples, makes it challenging for classification
problems to balance the false positive and false negative rates. A common
approach to addressing the challenge involves generating synthetic data for the
minority group and then training classification models with both observed and
synthetic data. However, since the synthetic data depends on the observed data
and fails to replicate the original data distribution accurately, prediction
accuracy is reduced when the synthetic data is naively treated as the true
data. In this paper, we address the bias introduced by synthetic data and
provide consistent estimators for this bias by borrowing information from the
majority group. We propose a bias correction procedure to mitigate the adverse
effects of synthetic data, enhancing prediction accuracy while avoiding
overfitting. This procedure is extended to broader scenarios with imbalanced
data, such as imbalanced multi-task learning and causal inference. Theoretical
properties, including bounds on bias estimation errors and improvements in
prediction accuracy, are provided. Simulation results and data analysis on
handwritten digit datasets demonstrate the effectiveness of our method.