reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Segment Any 3D Object with Language

Authors: Seungjun Lee, Yuyang Zhao, Gim H Lee

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our SOLE outperforms previous methods by a large margin on Scan Netv2, Scan Net200, and Replica benchmarks, and the results are even closed to the fully-supervised counterpart despite the absence of class annotations in the training. Furthermore, extensive qualitative results demonstrate the versatility of our SOLE to language instructions.
Researcher Affiliation	Academia	Seungjun Lee , Yuyang Zhao , Gim Hee Lee Department of Computer Science, National University of Singapore EMAIL
Pseudocode	No	The paper describes the methodology in prose and uses diagrams (e.g., Figure 3 for the overall framework) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	To ensure the reproducibility of our work, we state specific implementation details in Sec. 4.1 and Appendix. Sec. A. The code will be made publicly available.
Open Datasets	Yes	We evaluate SOLE on the popular scene understanding datasets: Scan Netv2 (Dai et al., 2017), Scan Net200 (Rozenberszki et al., 2022) and Replica (Straub et al., 2019) in both closedset and open-set 3D instance segmentation tasks.
Dataset Splits	Yes	For Scan Net200, both models are trained with mask annotations in Scan Netv2 (Dai et al., 2017). Following (Takmaz et al., 2023), 53 classes that are semantically close to the Scan Net, are grouped as Base . The remaining 147 classes are grouped as Novel .
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models) used for running its experiments.
Software Dependencies	Yes	To effectively generate a caption for each mask, we use a caption model in CLIP space, i.e., De Cap (Li et al., 2023). De Cap is a lightweight transformer model to generate captions from CLIP image embedding. It contains a 4-layer Transformer with 4 attention heads as the language model and the visual embedding is obtained from the pre-trained Vi T-L/32 CLIP model. We feed the mask features that are average pooled from the projected CLIP visual features into the De Cap model to obtain the mask caption. Then the caption is integrated into the text prompt a {} in a scene. to better align with our data, e.g. a blue chair in a scene. . With the mask caption, noun phrases are extracted by the NLP library, Text Blob (Loria et al., 2018) and spa Cy (Honnibal & Montani, 2017), to get the mask-entity association.
Experiment Setup	Yes	Our model is trained for 600 epochs with Adam W (Loshchilov & Hutter, 2017) optimizer. The learning rate is set to 1 x 10-4 with cyclical decay. In training, we set ̴̴MMA = 20.0, ̴̴dice = 2.0 and ̴̴BCE = 5.0 as the loss weight.