To Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking
2510.01349v1
cs.LG, stat.ML
2025-10-04
Авторы:
Hannah Lawrence, Elyssa Hofgard, Vasco Portilheiro, Yuxuan Chen, Tess Smidt, Robin Walters
Abstract
Symmetry-aware methods for machine learning, such as data augmentation and
equivariant architectures, encourage correct model behavior on all
transformations (e.g. rotations or permutations) of the original dataset. These
methods can improve generalization and sample efficiency, under the assumption
that the transformed datapoints are highly probable, or "important", under the
test distribution. In this work, we develop a method for critically evaluating
this assumption. In particular, we propose a metric to quantify the amount of
anisotropy, or symmetry-breaking, in a dataset, via a two-sample neural
classifier test that distinguishes between the original dataset and its
randomly augmented equivalent. We validate our metric on synthetic datasets,
and then use it to uncover surprisingly high degrees of alignment in several
benchmark point cloud datasets. We show theoretically that distributional
symmetry-breaking can actually prevent invariant methods from performing
optimally even when the underlying labels are truly invariant, as we show for
invariant ridge regression in the infinite feature limit. Empirically, we find
that the implication for symmetry-aware methods is dataset-dependent:
equivariant methods still impart benefits on some anisotropic datasets, but not
others. Overall, these findings suggest that understanding equivariance -- both
when it works, and why -- may require rethinking symmetry biases in the data.
Ссылки и действия
Дополнительные ресурсы: