Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics
2510.05558v1
cs.CV, cs.LG
2025-10-09
Авторы:
Christopher Hoang, Mengye Ren
Abstract
Object recognition and motion understanding are key components of perception
that complement each other. While self-supervised learning methods have shown
promise in their ability to learn from unlabeled data, they have primarily
focused on obtaining rich representations for either recognition or motion
rather than both in tandem. On the other hand, latent dynamics modeling has
been used in decision making to learn latent representations of observations
and their transformations over time for control and planning tasks. In this
work, we present Midway Network, a new self-supervised learning architecture
that is the first to learn strong visual representations for both object
recognition and motion understanding solely from natural videos, by extending
latent dynamics modeling to this domain. Midway Network leverages a midway
top-down path to infer motion latents between video frames, as well as a dense
forward prediction objective and hierarchical structure to tackle the complex,
multi-object scenes of natural videos. We demonstrate that after pretraining on
two large-scale natural video datasets, Midway Network achieves strong
performance on both semantic segmentation and optical flow tasks relative to
prior self-supervised learning methods. We also show that Midway Network's
learned dynamics can capture high-level correspondence via a novel analysis
method based on forward feature perturbation.
Ссылки и действия
Дополнительные ресурсы: