LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Authors: Paul Mcvay, Sergio Arnaud, Ada Martin, Arjun Majumdar, Krishna Murthy Jatavallabhula, Phillip Thomas, Ruslan Partsey, Daniel Dugas, Abha Gejji, Alexander Sax, Vincent-Pierre Berges, Mikael Henaff, Ayush Jain, Ang Cao, Ishita Prasad, Mrinal Kalakrishnan, Michael Rabbat, Nicolas Ballas, Mido Assran, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments and Analysis In this section, we report results for our trained models. LOCATE 3D is trained and evaluated the standard 3D referential grounding benchmarks SR3D, NR3D (Achlioptas et al., 2020), and Scan Refer (Chen et al., 2020). We compare with prior work and two vision-language model (VLM) baselines. The VLM baselines process the RGB-D observations with a modular pipeline composed of three stages. ... We present the overall results in Table 1. ... Section 4.2 analyzes the impact of 3D-JEPA pre-training. Section 4.3 presents ablation studies on various components of our architecture...
Researcher Affiliation Collaboration 1FAIR at Meta 2Carnegie Mellon University 3University of Michigan, Ann Arbor. Correspondence to: Sergio Arnaud <EMAIL>, Paul Mc Vay <EMAIL>.
Pseudocode No The paper describes the model architecture and training procedures in Sections 2 and 3, and Appendix A.1 and B.1 provide further architectural details, but no explicit pseudocode or algorithm blocks are present.
Open Source Code Yes Code, models and dataset can be found at the project website: locate3d.atmeta.com
Open Datasets Yes Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d.atmeta.com
Dataset Splits Yes In total, our dataset contains 131,641 samples. Decomposed by scene dataset, L3DD contains: 1. Scan Net: 30,135 new language annotations covering 550 venues and 5,527 objects for training. 4,470 new language annotations covering 130 venues and 1038 objects for validation. 2. Scan Net++: 91,846 new language annotations covering 230 venues and 13,359 objects for training. 3,774 new language annotations covering 50 venues and 1,303 objects for validation. 3. ARKit Scenes: 991 new language annotations covering 293 venues and 1,862 objects covering scenes used for pretraining. 425 new language annotations covering 93 venues and 460 objects for validation.
Hardware Specification Yes With this feature cache, a forward pass of our model takes 1 second for a scene with 100k feature points and utilizes 8 GB of VRAM on an A100 GPU.
Software Dependencies No The paper mentions various models and tools used (e.g., Llama-3, GPT-4o, SAM 2, Grounding DINO, Adam W) but does not provide specific version numbers for these software components or other key libraries like Python, PyTorch, or CUDA.
Experiment Setup Yes LOCATE 3D is optimized using Adam W (Loshchilov and Hutter, 2019) with parameters β1 = 0.9, β2 = 0.999, weight decay of 0.01 and using a learning rate scheduler as described in Appendix C.2. We optimize the following loss function: L = λdice Ldice + λce Lce + λbox Lbox + λgiou Lgiou + λalign Lalign λce = 4.0 (Class weight) λmask = 6.0 (Mask cross entropy weight) λdice = 4.0 (Mask dice weight) λbox = 1.0 (Bounding box L1 weight) λgiou = 1.0 (Bounding box GIo U weight)