Tackling View-Dependent Semantics in 3D Language Gaussian Splatting

Authors: Jiazhong Cen, Xudong Zhou, Jiemin Fang, Changsong Wen, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that La Ga effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, La Ga achieves a significant improvement of +18.7% m Io U over the previous SOTA on the LERF-OVS dataset. Our code is available at: https: //github.com/https://github.com/ SJTU-Deep Vision Lab/La Ga.
Researcher Affiliation Collaboration 1Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University 2Huawei Technologies Co., Ltd.. Correspondence to: Wei Shen <EMAIL>, Jiemin Fang <EMAIL>.
Pseudocode Yes B.4. Detailed Algorithm of the Cross-View Descriptor Extraction The pseudo code of the cross-view descriptor extraction is shown in Algorithm 1.
Open Source Code Yes Our code is available at: https: //github.com/https://github.com/ SJTU-Deep Vision Lab/La Ga.
Open Datasets Yes We evaluate La Ga on LERF-OVS (Kerr et al., 2023; Qin et al., 2024), 3D-OVS (Liu et al., 2023a), and Scan Net (Dai et al., 2017). LERF-OVS consists of complex 360 indoor scenes, while 3D-OVS features forward-facing scenes with long-tailed categories. Both datasets provide 2D annotations.
Dataset Splits No The paper mentions training on a 'training set I' and evaluates on datasets like LERF-OVS, 3D-OVS, and Scan Net. However, it does not explicitly provide specific training/test/validation split percentages, sample counts, or references to predefined splits for these datasets.
Hardware Specification Yes All experiments are conducted on a single NVIDIA RTX 3090 GPU.
Software Dependencies No The paper mentions using "Vi T-H model of SAM and the Open CLIP Vi T-B/16 model of CLIP" but does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes For each scene, the 3D-GS model is trained for 30000 iterations, followed by 30000 iterations training of the Gaussian affinity features. For Scan Net, we apply a KNN-based local feature smoothing operation following SAGA (Cen et al., 2025a) during training the affinity features. During inference, in addition to the relevance score, we find that applying an auxiliary cosine similarity threshold (0.23) helps remove unwanted regions. For all remained objects in the scene, relevance scores are first min-max normalized. A 3D bilateral filtering step is then applied to the resulting 3D relevance map to suppress noise. Gaussians with relevance scores above 0.6 are classified as foreground.