UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations
2510.13774v1
cs.LG, cs.CV
2025-10-17
Авторы:
Dominik J. Mühlematter, Lin Che, Ye Hong, Martin Raubal, Nina Wiedemann
Abstract
Forecasting urban phenomena such as housing prices and public health
indicators requires the effective integration of various geospatial data.
Current methods primarily utilize task-specific models, while recent foundation
models for spatial representations often support only limited modalities and
lack multimodal fusion capabilities. To overcome these challenges, we present
UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal
Fusion (SMF). The framework employs modality-specific encoders to process
different types of inputs, including street view imagery, remote sensing data,
cartographic maps, and points of interest (POIs) data. These multimodal inputs
are integrated via a Transformer-based fusion module that learns unified
representations. An extensive evaluation across 41 tasks in 56 cities worldwide
demonstrates UrbanFusion's strong generalization and predictive performance
compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms
prior foundation models on location-encoding, 2) allows multimodal input during
inference, and 3) generalizes well to regions unseen during training.
UrbanFusion can flexibly utilize any subset of available modalities for a given
location during both pretraining and inference, enabling broad applicability
across diverse data availability scenarios. All source code is available at
https://github.com/DominikM198/UrbanFusion.
Ссылки и действия
Дополнительные ресурсы: