🗺️🔍 OVI-MAP

Open-Vocabulary Instance-Semantic Mapping

CVPR 2026 Highlight✨

1ETH Zurich2Google3University of Zurich4TU Munich5Microsoft
† Equal contribution.
teaser-fig.

Given a streaming RGB-D sequence with camera poses, OVI-MAP incrementally reconstructs a volumetric 3D scene while maintaining a class-agnostic instance map. Semantic features for the instances are then aggregated incrementally in a zero-shot manner using selectively chosen views, enabling open-set object recognition. Our method supports real-time, open-world scene reconstruction with instance-level semantic understanding.

Abstract

Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.

Instance Map

Interative visualization of the reconstructed instance map. Each instance is assigned a unique color, and the same instance maintains consistent coloring across frames.

Semantic Map

Interactive visualization of the incrementally aggregated semantic map. Each instance in the instance map is colored according to its evaluated semantic category.

System Overview

Part A & B: Class-Agnostic Instance Map Reconstruction

Part C & D: Incremental Semantic Features Aggregation.

Incremental Semantic Mapping

Top Figure: The left side shows the pixel-counting strategy prioritizing frames with larger object masking area, often leading to redundant front-facing views.
The right side depicts our proposed object-centric view coverage method, which maintains a spherical map of the explored viewing directions and selects frames that provide novel perspectives of the object.

Right Figure: We compare our incremental semantic aggregation using the view coverage strategy against pixel-counting and other baselines. Our method achieves comparable semantic accuracy while requiring significantly fewer VLM queries per instance (bottom-right figure), demonstrating efficient and scalable open-vocabulary mapping during online exploration.

Visualizations

Instance Maps

Colors are randomly assigned for all instance maps according to the instance labels. Gray regions indicate unobserved areas for online methods (Ours and OVO-SLAM), and unlabeled for offline methods (Segment3D, Mask3D).

Qualitative comparison of instance maps - Replica dataset.

Qualitative comparison of instance maps - ScanNet dataset.

Evaluated Semantic Maps (Colors from the Dataset)

Qualitative comparison of semantic maps - Replica dataset.

Qualitative comparison of semantic maps - ScanNet dataset (200 categories, colors not listed).

Open-Vocabulary Querying 🔎

Heat Maps 🌡️

Belows are the heat maps for semantic querying to the scenes according to the constructed semantic code-book. The color closer to red indicates the instance is more similar to the query semantic label, while the color closer to blue indicates the instance is less similar to the query semantic label. Unobserved areas are shown in black.

Replica dataset.

ScanNet dataset.

Instance Highlighting 💡

Given natural language prompts, our system retrieves and highlights corresponding 3D instances based on the learned vision-language embeddings. Darker tones indicate higher cosine similarity between an object and the query.

Instance highlighting from arbitrary text queries.

BibTeX


      @misc{deng2026ovimap,
        title={OVI-MAP:Open-Vocabulary Instance-Semantic Mapping}, 
        author={Zilong Deng and Federico Tombari and Marc Pollefeys and Johanna Wald and Daniel Barath},
        year={2026},
        eprint={2603.26541},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2603.26541}, 
      }