reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Authors: Paul Mcvay, Sergio Arnaud, Ada Martin, Arjun Majumdar, Krishna Murthy Jatavallabhula, Phillip Thomas, Ruslan Partsey, Daniel Dugas, Abha Gejji, Alexander Sax, Vincent-Pierre Berges, Mikael Henaff, Ayush Jain, Ang Cao, Ishita Prasad, Mrinal Kalakrishnan, Michael Rabbat, Nicolas Ballas, Mido Assran, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4. Experiments and Analysis In this section, we report results for our trained models. LOCATE 3D is trained and evaluated the standard 3D referential grounding benchmarks SR3D, NR3D (Achlioptas et al., 2020), and Scan Refer (Chen et al., 2020). We compare with prior work and two vision-language model (VLM) baselines. The VLM baselines process the RGB-D observations with a modular pipeline composed of three stages. ... We present the overall results in Table 1. ... Section 4.2 analyzes the impact of 3D-JEPA pre-training. Section 4.3 presents ablation studies on various components of our architecture...
Researcher Affiliation	Collaboration	1FAIR at Meta 2Carnegie Mellon University 3University of Michigan, Ann Arbor. Correspondence to: Sergio Arnaud <EMAIL>, Paul Mc Vay <EMAIL>.
Pseudocode	No	The paper describes the model architecture and training procedures in Sections 2 and 3, and Appendix A.1 and B.1 provide further architectural details, but no explicit pseudocode or algorithm blocks are present.
Open Source Code	Yes	Code, models and dataset can be found at the project website: locate3d.atmeta.com
Open Datasets	Yes	Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d.atmeta.com
Dataset Splits	Yes	In total, our dataset contains 131,641 samples. Decomposed by scene dataset, L3DD contains: 1. Scan Net: 30,135 new language annotations covering 550 venues and 5,527 objects for training. 4,470 new language annotations covering 130 venues and 1038 objects for validation. 2. Scan Net++: 91,846 new language annotations covering 230 venues and 13,359 objects for training. 3,774 new language annotations covering 50 venues and 1,303 objects for validation. 3. ARKit Scenes: 991 new language annotations covering 293 venues and 1,862 objects covering scenes used for pretraining. 425 new language annotations covering 93 venues and 460 objects for validation.
Hardware Specification	Yes	With this feature cache, a forward pass of our model takes 1 second for a scene with 100k feature points and utilizes 8 GB of VRAM on an A100 GPU.
Software Dependencies	No	The paper mentions various models and tools used (e.g., Llama-3, GPT-4o, SAM 2, Grounding DINO, Adam W) but does not provide specific version numbers for these software components or other key libraries like Python, PyTorch, or CUDA.
Experiment Setup	Yes	LOCATE 3D is optimized using Adam W (Loshchilov and Hutter, 2019) with parameters β1 = 0.9, β2 = 0.999, weight decay of 0.01 and using a learning rate scheduler as described in Appendix C.2. We optimize the following loss function: L = λdice Ldice + λce Lce + λbox Lbox + λgiou Lgiou + λalign Lalign λce = 4.0 (Class weight) λmask = 6.0 (Mask cross entropy weight) λdice = 4.0 (Mask dice weight) λbox = 1.0 (Bounding box L1 weight) λgiou = 1.0 (Bounding box GIo U weight)