3D Gaussian Splatting (3DGS) has emerged as a powerful representation for neural scene reconstruction, offering high-quality novel view synthesis while maintaining computational efficiency. In this paper, we extend the capabilities of 3DGS beyond pure scene representation by introducing an approach for open-vocabulary 3D instance segmentation without requiring manual labeling, termed OpenSplat3D. Our method leverages feature-splatting techniques to associate semantic information with individual Gaussians, enabling fine-grained scene understanding. We incorporate Segment Anything Model instance masks with a contrastive loss formulation as guidance for the instance features to achieve accurate instance-level segmentation. Furthermore, we utilize language embeddings of a vision-language model, allowing for flexible, text-driven instance identification. This combination enables our system to identify and segment arbitrary objects in 3D scenes based on natural language descriptions. We show results on LERF-mask and LERF-OVS as well as the full ScanNet++ validation set, demonstrating the effectiveness of our approach.
Overview of our proposed pipeline. On the left are the training inputs: posed RGB-images, a coarse SfM point cloud for initialization, and the extracted SAM masks. The middle section illustrates the instance learning with a Gaussian feature field optimization, as well as clustering to obtain coherent 3D instances. On the right, we demonstrate the language integration, where the top-k informative views are identified per instance, hierarchical crops are constructed and finally the language embedding per instance is computed.
Method | figurines | ramen | teatime | mean | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
mIoU | mBIoU | mIoU | mBIoU | mIoU | mBIoU | mIoU | mBIoU | ||||
LERF* | 33.5 | 30.6 | 28.3 | 14.7 | 49.7 | 42.6 | 37.2 | 29.3 | |||
LangSplat* | 52.8 | 50.5 | 50.4 | 44.7 | 69.5 | 65.6 | 57.6 | 53.6 | |||
Gaussian Grouping* | 69.7 | 67.9 | 77.0 | 68.7 | 71.7 | 66.1 | 72.8 | 67.6 | |||
CGC | 91.6 | 88.8 | 68.7 | 63.1 | 80.5 | 78.9 | 80.3 | 76.9 | |||
OpenSplat3D (Ours) | 92.3 | 89.4 | 75.9 | 68.2 | 83.7 | 78.8 | 84.0 | 78.8 |
Semantic segmentation results on the LERF-mask dataset. We report the mean IoU and mean BIoU for each scene and the overall average. Our method achieves the best overall performance across all metrics. Only for the ramen scene, Gaussian Grouping performs slightly better. *: Results as reported in the Gaussian Grouping paper.
Method | figurines | ramen | teatime | waldo_kitchen | mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mIoU | mAcc. | mIoU | mAcc. | mIoU | mAcc. | mIoU | mAcc. | mIoU | mAcc. | |||||
LangSplat | 10.16 | 8.93 | 7.92 | 11.27 | 11.38 | 20.34 | 9.18 | 9.09 | 9.66 | 12.41 | ||||
LEGaussians | 17.99 | 23.21 | 15.79 | 26.76 | 19.27 | 27.12 | 11.78 | 18.18 | 16.21 | 23.82 | ||||
OpenGaussian | 39.29 | 55.36 | 31.01 | 42.25 | 60.44 | 76.27 | 22.70 | 31.82 | 38.36 | 51.43 | ||||
OpenSplat3D (Ours) | 60.71 | 85.71 | 49.20 | 76.06 | 73.27 | 88.14 | 55.63 | 77.27 | 59.70 | 81.79 |
LERF-OVS 3D object selection evaluation from textual query. Following OpenGaussian, only the Gaussians responding to the query are rendered, therefore the rendering does not respect occlusion by other objects in the scene. Accuracy is provided by mAcc@0.25. Note that OpenGaussian fine-tunes parameters per scene for best results.
Method | without post-processing | with post-processing | |||||
---|---|---|---|---|---|---|---|
AP | AP50 | AP25 | AP | AP50 | AP25 | ||
SAM3D | 3.9 | 9.3 | 22.1 | 8.4 | 16.1 | 30.0 | |
Segment3D | 13.0 | 23.8 | 38.3 | 20.2 | 30.9 | 42.7 | |
Open3DIS* | - | - | - | 20.7 | 38.6 | 47.1 | |
OpenSplat3D (Ours) | 19.2 | 37.3 | 56.2 | 24.5 | 41.7 | 57.1 |
Class-agnostic instance segmentation on ScanNet++ validation split. *: Open3DIS uses superpoints produced by Felzenszwalb and Huttenlocher segmentation directly in their pipeline.
Method | Setting | AP | AP50 | AP25 |
---|---|---|---|---|
SGFormer | fully-supervised | 23.9 | 37.5 | 46.6 |
Mask3D (+ OpenMask3D) | open-vocabulary | - | 15.0 | - |
Segment3D (+ OpenMask3D) | open-vocabulary | - | 18.5 | - |
OpenSplat3D (Ours) | open-vocabulary | 16.5 | 29.7 | 39.0 |
Instance Segmentation on the ScanNet++ validation split. Our method not only outperforms the other open-vocabulary methods by a large margin, it also reduces the gap to the state-of-the-art fully-supervised SGIFormer approach.
@InProceedings{piekenbrinck2025opensplat3d,
title = {{OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting}},
author = {Piekenbrinck, Jens and Schmidt, Christian and Hermans, Alexander and Vaskevicius, Narunas and Linder, Timm and Leibe, Bastian},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference},
pages = {5246--5255},
year = {2025}
}