OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

2508.04611v1 cs.CV, cs.RO 2025-08-09

Авторы:

Tongfan Guan, Jiaxin Guo, Chen Wang, Yun-Hui Liu

Резюме на русском

Многокамерная и многомодальная оценка глубины представляют собой важные подходы к решению проблемы 3D-перцепции, но каждый из них имеет свои ограничения. Многокамерная оценка глубины (monocular) способна логически структурировать пространство, но часто неточна в геометрических вычислениях. Многомодальная (stereo) оценка глубины, в свою очередь, полагается на эпиполярную геометрию, что делает ее эффективной в обработке сложных поверхностей, но она страдает от явных сильных амбигуитей на поверхностях с низким контрастом или текстурой. Данная работа предлагает OmniDepth — продвинутую модель, которая объединяет эти два подхода в единое целое. Основной инновацией является алгоритм синхронизации между локальными признаками 3D-пространства (monocular) и геометрическими моделями (stereo), который достигается с помощью нового механизма кросс-аттенции. Результаты экспериментов показали, что OmniDepth уменьшает ошибку нулевого шага обучения более чем на 40% на Middlebury и ETH3D, а также улучшает результаты на поверхностях типа зеркальных и прозрачных. Эта модель является первым подходом к решению проблемы между монокамеровой и многокамеровой оценкой глубины.

Abstract

Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: \textbf{OmniDepth reduces zero-shot generalization error by $\!>\!40\%$ on Middlebury and ETH3D}, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, OmniDepth enables robust 3D perception that transcends modality-specific limitations. Codes available at https://github.com/aeolusguan/OmniDepth.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neur...

Object Reconstruction under Occlusion with Generative Priors and Contact-induced...

Image Generation as a Visual Planner for Robotic Manipulation

TrajDiff: End-to-end Autonomous Driving without Perception Annotation

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minima...

Навигация