Segment Any 3D Object with Language

Authors: Seungjun Lee, Yuyang Zhao, Gim H Lee

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our SOLE outperforms previous methods by a large margin on Scan Netv2, Scan Net200, and Replica benchmarks, and the results are even closed to the fully-supervised counterpart despite the absence of class annotations in the training. Furthermore, extensive qualitative results demonstrate the versatility of our SOLE to language instructions.
Researcher Affiliation Academia Seungjun Lee , Yuyang Zhao , Gim Hee Lee Department of Computer Science, National University of Singapore EMAIL
Pseudocode No The paper describes the methodology in prose and uses diagrams (e.g., Figure 3 for the overall framework) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No To ensure the reproducibility of our work, we state specific implementation details in Sec. 4.1 and Appendix. Sec. A. The code will be made publicly available.
Open Datasets Yes We evaluate SOLE on the popular scene understanding datasets: Scan Netv2 (Dai et al., 2017), Scan Net200 (Rozenberszki et al., 2022) and Replica (Straub et al., 2019) in both closedset and open-set 3D instance segmentation tasks.
Dataset Splits Yes For Scan Net200, both models are trained with mask annotations in Scan Netv2 (Dai et al., 2017). Following (Takmaz et al., 2023), 53 classes that are semantically close to the Scan Net, are grouped as Base . The remaining 147 classes are grouped as Novel .
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models) used for running its experiments.
Software Dependencies Yes To effectively generate a caption for each mask, we use a caption model in CLIP space, i.e., De Cap (Li et al., 2023). De Cap is a lightweight transformer model to generate captions from CLIP image embedding. It contains a 4-layer Transformer with 4 attention heads as the language model and the visual embedding is obtained from the pre-trained Vi T-L/32 CLIP model. We feed the mask features that are average pooled from the projected CLIP visual features into the De Cap model to obtain the mask caption. Then the caption is integrated into the text prompt a {} in a scene. to better align with our data, e.g. a blue chair in a scene. . With the mask caption, noun phrases are extracted by the NLP library, Text Blob (Loria et al., 2018) and spa Cy (Honnibal & Montani, 2017), to get the mask-entity association.
Experiment Setup Yes Our model is trained for 600 epochs with Adam W (Loshchilov & Hutter, 2017) optimizer. The learning rate is set to 1 x 10-4 with cyclical decay. In training, we set ̴̴MMA = 20.0, ̴̴dice = 2.0 and ̴̴BCE = 5.0 as the loss weight.