Language-Image Models with 3D Understanding

Authors: Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on outdoor benchmarks demonstrate that CUBE-LLM significantly outperforms existing baselines by 21.3 points of APBEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the Drive LM dataset for complex reasoning about driving scenarios, respectively.
Researcher Affiliation Collaboration Jang Hyun Cho1, Boris Ivanovic2 Yulong Cao2 Edward Schmerling2 Yue Wang2 Xinshuo Weng2 Boyi Li2 Yurong You2 Philipp Krähenbühl1, Yan Wang2, Marco Pavone2, 1 University of Texas at Austin 2 NVIDIA
Pseudocode No The paper describes methods and processes in narrative text and diagrams (Figures 1, 3, 4, 6, 8-21), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for CUBE-LLM or a link to a code repository. It references existing open-source architectures (LLaVA-1.5, DINOv2) but not its own implementation.
Open Datasets Yes We pre-train CUBE-LLM on LV3D, a large collection of 2D and 3D dataset (Table 1). We format the existing labels into multi-turn instruction-following tasks under standard format, as described in Section 3.1, 3.2, and 3.3. We describe details of dataset construction in the section C of the appendix. We evaluate our model on diverse tasks, including the following 3D grounding datasets. Talk2Car (Deruyttere et al., 2019)... Drive LM (Sima et al., 2023)...
Dataset Splits Yes Talk2Car (Deruyttere et al., 2019) is a 3D referring expression comprehension dataset... It consists of 8,349 training samples and 1,163 validation samples... We process Drive LM and construct a 3D grounding dataset... We sample 600 scenes for training and 96 scenes for validation... We hold out scene IDs: '64a3a2d22172406c848f2a92275808ba', '08be42eb2186411d8e2201225329f1c6', '4b5bf3f4668d44fea9a676e9c4a8a79e', '0e247ba64b9d4a34a7256b6c173b1b5d', 'dbd9183e1278475ea54761297e004b04', '4098aaf3c7074e7d87285e2fc95369e0', '9f3c8453d03d4df5946444757376b826', '2fc3753772e241f2ab2cd16a784cc680', 'd0880a386b6d434bb5cd13c134af7a3e', '01c3f5e39956402da3e37845632fadca' in our split evaluation.
Hardware Specification Yes During pretraining, we use 8 × 8 A100s with a batch size of 1024 and train the model with a learning rate lr = 2 × 10−5 on images with 336 × 336 resolution. Then, we fine-tune all parameters including the visual encoder on a higher resolution 672 × 672 with 8 × 8 A100s and a batch size of 256 with 4 gradient accumulation steps (effective batch size of 1024) and a learning rate lr = 2 × 10−5.
Software Dependencies Yes We use LLaVA-1.5 Liu et al. (2023a) with Vicuna-7B as our base model. We replace the CLIP visual encoder with ViT-L/14 Dosovitskiy et al. (2021) based DINOv2.
Experiment Setup Yes During pretraining, we use 8 × 8 A100s with a batch size of 1024 and train the model with a learning rate lr = 2 × 10−5 on images with 336 × 336 resolution. Then, we fine-tune all parameters including the visual encoder on a higher resolution 672 × 672 with 8 × 8 A100s and a batch size of 256 with 4 gradient accumulation steps (effective batch size of 1024) and a learning rate lr = 2 × 10−5. Figure 8: Context length: 2048 Image size: 336 x 336 and Context length: 4096 Image size: 672 x 672.