Unifying Information-Theoretic and Pair-Counting Clustering Similarity
2511.03000v1
stat.ML, cs.IT, cs.LG, math.IT
2025-11-08
Авторы:
Alexander J. Gates
Abstract
Comparing clusterings is central to evaluating unsupervised models, yet the
many existing similarity measures can produce widely divergent, sometimes
contradictory, evaluations. Clustering similarity measures are typically
organized into two principal families, pair-counting and information-theoretic,
reflecting whether they quantify agreement through element pairs or aggregate
information across full cluster contingency tables. Prior work has uncovered
parallels between these families and applied empirical normalization or
chance-correction schemes, but their deeper analytical connection remains only
partially understood. Here, we develop an analytical framework that unifies
these families through two complementary perspectives. First, both families are
expressed as weighted expansions of observed versus expected co-occurrences,
with pair-counting arising as a quadratic, low-order approximation and
information-theoretic measures as higher-order, frequency-weighted extensions.
Second, we generalize pair-counting to $k$-tuple agreement and show that
information-theoretic measures can be viewed as systematically accumulating
higher-order co-assignment structure beyond the pairwise level. We illustrate
the approaches analytically for the Rand index and Mutual Information, and show
how other indices in each family emerge as natural extensions. Together, these
views clarify when and why the two regimes diverge, relating their
sensitivities directly to weighting and approximation order, and provide a
principled basis for selecting, interpreting, and extending clustering
similarity measures across applications.