Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events

2508.05507v1 cs.CV 2025-08-09

Авторы:

Lin Zhu, Ruonan Liu, Xiao Wang, Lizhi Wang, Hua Huang

Резюме на русском

Event camera — это инновационный нейроморфный визуальный сенсор, который заметает изображения с высокой темпоральной разрешенностью и широким динамическим диапазоном, благодаря чему позволяет извлекать точные визуальные представления в сложных сценариях. Однако существуют значительные трудности при извлечении признаков из эвент-данных, поскольку они характеризуются его монотонностью и шумовой природой, в основном отражая изменения яркости. Для решения этой проблемы предлагается метод самостоятельного обучения, нацеленный на раскрытие запасных семантических сведений в event data, включая информацию об обводах и текстуре. Разработанный подход включает в себя три этапа: гидроактивное моделирование с пропусками на основе физического семплирования, утяжеление данных внешними соображениями, и семантическое разрешение через контрастное обучение. Опыты показали, что предложенный подход обеспечивает высокую устойчивость и превосходит современные методы на различных задачах, таких как распознавание объектов, сегментация сцен и оптическое размещение потока.

Abstract

Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT-Vision/EventPretrain.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimoda...

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with P...

ViDiC: Video Difference Captioning

Beyond the Ground Truth: Enhanced Supervision for Image Restoration

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task ...

Навигация