Recently, 3D Gaussian Splatting (3DGS) has achieved significant performance on indoor surface reconstruction and open-vocabulary segmentation. This paper presents GLS, a unified framework of surface reconstruction and open-vocabulary segmentation based on 3DGS. GLS extends two fields by exploring the correlation between them. For indoor surface reconstruction, we introduce surface normal prior as a geometric cue to guide the rendered normal, and use the normal error to optimize the rendered depth. For open-vocabulary segmentation, we employ 2D CLIP features to guide instance features and utilize DEVA masks to enhance their view consistency. Extensive experiments demonstrate the effectiveness of jointly optimizing surface reconstruction and open-vocabulary segmentation, where GLS surpasses state-of-the-art approaches of each task on MuSHRoom, ScanNet++, and LERF-OVS datasets.
Given multi-view RGB images captured by a camera in an indoor scene, our goal is to jointly reconstruct the scene and open-vocabulary objects. To achieve this goal, we introduce GLS, a novel framework based on 3DGS. As shown in~\figref{fig:pip}, our framework consists of three procedures. In the input procedure, we use the generalizable model SAM, DEVA and CLIP to produce 2D consistent semantic masks and object-level features. Then we adopt the generalizable model of surface normal estimation to acquire the geometric cue. In the optimization procedure, we utilize the semantic and normal priors for regularization. We first follow previous approaches to regularize the rendered color, depth and semantic feature. Then we propose a novel smoothness term to tackle texture-less regions and a novel constraint by analyzing the normal error of Gaussians to refine object structures. In the inference procedure, our model reconstructs the indoor surface and selects the target object by the open-vocabulary text simultaneously.
Scene editing effects
We develop a visualization tool based on nerfview + GPT-4V, for exploring the potential applications of embodied intelligence.
Prompt: Left: "I want to make a piece of toast" Right: "I want to cut a tomato"
Prompt: Left: "I want to wash my hands" Right: "I want to store my ice cream"
More fancy results
Compared to previous methods on indoor surface reconstruction.
Compared to previous methods on 3D open-vocabulary segmentation (Dataset: LERF-OVS).
@article{qiu2024gls,
title={GLS: Geometry-aware 3D Language Gaussian Splatting},
author={Qiu, Jiaxiong and Liu, Liu and Su, Zhizhong and Lin, Tianwei},
journal={arXiv preprint arXiv:2411.18066},
year={2024}
}