SLGaussian: Fast Language Gaussian Splatting in Sparse Views

Kangjie Chen1*, Bingquan Dai1*;, Minghan Qin1, Dongbin Zhang1, Peihao Li1, Yingshuang Zou1, Haoqian Wang1† (* indicates equal contribution, † means corresponding author)
1Tsinghua Shenzhen International Graduate School, Tsinghua University
Accepted to ACM MM 2025

With just two RGB views, our method infers 3D semantic fields in under 30 seconds without per-scene optimization. On LERF and 3D-OVS datasets (image resolution 416 × 576), query time is 0.011 seconds/query, outperforming existing methods in both speed and IoU metrics.

Abstract

3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.

With just two RGB views, our method infers 3D semantic fields in under 30 seconds without per-scene optimization. On LERF and 3D-OVS datasets (image resolution 416 × 576), query time is 0.011 seconds/query, outperforming existing methods in both speed and IoU metrics.

3D semantic field video results on test scenes from the RealEstate10K dataset, using only two sparse-view RGB images as input.

Using our model pre-trained on the RealEstate10K dataset, we conducted inference on the LERF dataset scenes, using only two sparse-view RGB images — the first and last frames — as input.

Visual comparison on LERF and 3D-OVS datasets.

Qualitative comparisons of open-vocabulary 3D object localization on the LERF and 3D-OVS datasets. The top row displays scenes from the LERF dataset, while the bottom row shows scenes from the 3D-OVS dataset. Red points indicate the model predictions, and black dashed bounding boxes denote the annotations.

BibTeX

@article{chen2024slgaussian,
  title={Slgaussian: Fast language gaussian splatting in sparse views},
  author={Chen, Kangjie and Dai, BingQuan and Qin, Minghan and Zhang, Dongbin and Li, Peihao and Zou, Yingshuang and Wang, Haoqian},
  journal={arXiv preprint arXiv:2412.08331},
  year={2024}
}