reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language-Image Models with 3D Understanding

Authors: Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on outdoor benchmarks demonstrate that CUBE-LLM significantly outperforms existing baselines by 21.3 points of APBEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the Drive LM dataset for complex reasoning about driving scenarios, respectively.
Researcher Affiliation	Collaboration	Jang Hyun Cho1, Boris Ivanovic2 Yulong Cao2 Edward Schmerling2 Yue Wang2 Xinshuo Weng2 Boyi Li2 Yurong You2 Philipp Krähenbühl1, Yan Wang2, Marco Pavone2, 1 University of Texas at Austin 2 NVIDIA
Pseudocode	No	The paper describes methods and processes in narrative text and diagrams (Figures 1, 3, 4, 6, 8-21), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for CUBE-LLM or a link to a code repository. It references existing open-source architectures (LLaVA-1.5, DINOv2) but not its own implementation.
Open Datasets	Yes	We pre-train CUBE-LLM on LV3D, a large collection of 2D and 3D dataset (Table 1). We format the existing labels into multi-turn instruction-following tasks under standard format, as described in Section 3.1, 3.2, and 3.3. We describe details of dataset construction in the section C of the appendix. We evaluate our model on diverse tasks, including the following 3D grounding datasets. Talk2Car (Deruyttere et al., 2019)... Drive LM (Sima et al., 2023)...
Dataset Splits	Yes	Talk2Car (Deruyttere et al., 2019) is a 3D referring expression comprehension dataset... It consists of 8,349 training samples and 1,163 validation samples... We process Drive LM and construct a 3D grounding dataset... We sample 600 scenes for training and 96 scenes for validation... We hold out scene IDs: '64a3a2d22172406c848f2a92275808ba', '08be42eb2186411d8e2201225329f1c6', '4b5bf3f4668d44fea9a676e9c4a8a79e', '0e247ba64b9d4a34a7256b6c173b1b5d', 'dbd9183e1278475ea54761297e004b04', '4098aaf3c7074e7d87285e2fc95369e0', '9f3c8453d03d4df5946444757376b826', '2fc3753772e241f2ab2cd16a784cc680', 'd0880a386b6d434bb5cd13c134af7a3e', '01c3f5e39956402da3e37845632fadca' in our split evaluation.
Hardware Specification	Yes	During pretraining, we use 8 × 8 A100s with a batch size of 1024 and train the model with a learning rate lr = 2 × 10−5 on images with 336 × 336 resolution. Then, we fine-tune all parameters including the visual encoder on a higher resolution 672 × 672 with 8 × 8 A100s and a batch size of 256 with 4 gradient accumulation steps (effective batch size of 1024) and a learning rate lr = 2 × 10−5.
Software Dependencies	Yes	We use LLaVA-1.5 Liu et al. (2023a) with Vicuna-7B as our base model. We replace the CLIP visual encoder with ViT-L/14 Dosovitskiy et al. (2021) based DINOv2.
Experiment Setup	Yes	During pretraining, we use 8 × 8 A100s with a batch size of 1024 and train the model with a learning rate lr = 2 × 10−5 on images with 336 × 336 resolution. Then, we fine-tune all parameters including the visual encoder on a higher resolution 672 × 672 with 8 × 8 A100s and a batch size of 256 with 4 gradient accumulation steps (effective batch size of 1024) and a learning rate lr = 2 × 10−5. Figure 8: Context length: 2048 Image size: 336 x 336 and Context length: 4096 Image size: 672 x 672.