PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
2510.08919v1
cs.CV, cs.LG
2025-10-14
Авторы:
Daiki Yoshikawa, Takashi Matsubara
Abstract
Vision-language models have achieved remarkable success in multi-modal
representation learning from large-scale pairs of visual scenes and linguistic
descriptions. However, they still struggle to simultaneously express two
distinct types of semantic structures: the hierarchy within a concept family
(e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across
different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent
works have addressed this challenge by employing hyperbolic space, which
efficiently captures tree-like hierarchy, yet its suitability for representing
compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP,
which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic
factors. With our design, intra-family hierarchies emerge within individual
hyperbolic factors, and cross-family composition is captured by the
$\ell_1$-product metric, analogous to a Boolean algebra. Experiments on
zero-shot classification, retrieval, hierarchical classification, and
compositional understanding tasks demonstrate that PHyCLIP outperforms existing
single-space approaches and offers more interpretable structures in the
embedding space.
Ссылки и действия
Дополнительные ресурсы: