LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction
2510.22141v1
cs.CV, cs.CL, cs.LG, cs.RO, eess.IV
2025-10-29
Авторы:
Yuhang Gao, Xiang Xiang, Sheng Zhong, Guoyou Wang
Abstract
Vision-Language Models (VLMs) have shown significant progress in open-set
challenges. However, the limited availability of 3D datasets hinders their
effective application in 3D scene understanding. We propose LOC, a general
language-guided framework adaptable to various occupancy networks, supporting
both supervised and self-supervised learning paradigms. For self-supervised
tasks, we employ a strategy that fuses multi-frame LiDAR points for
dynamic/static scenes, using Poisson reconstruction to fill voids, and
assigning semantics to voxels via K-Nearest Neighbor (KNN) to obtain
comprehensive voxel representations. To mitigate feature over-homogenization
caused by direct high-dimensional feature distillation, we introduce Densely
Contrastive Learning (DCL). DCL leverages dense voxel semantic information and
predefined textual prompts. This efficiently enhances open-set recognition
without dense pixel-level supervision, and our framework can also leverage
existing ground truth to further improve performance. Our model predicts dense
voxel features embedded in the CLIP feature space, integrating textual and
image pixel information, and classifies based on text and semantic similarity.
Experiments on the nuScenes dataset demonstrate the method's superior
performance, achieving high-precision predictions for known classes and
distinguishing unknown classes without additional training data.