Language-Image Models with 3D Understanding
Authors: Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on outdoor benchmarks demonstrate that CUBE-LLM significantly outperforms existing baselines by 21.3 points of APBEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the Drive LM dataset for complex reasoning about driving scenarios, respectively. |
| Researcher Affiliation | Collaboration | Jang Hyun Cho1, Boris Ivanovic2 Yulong Cao2 Edward Schmerling2 Yue Wang2 Xinshuo Weng2 Boyi Li2 Yurong You2 Philipp Krähenbühl1, Yan Wang2, Marco Pavone2, 1 University of Texas at Austin 2 NVIDIA |
| Pseudocode | No | The paper describes methods and processes in narrative text and diagrams (Figures 1, 3, 4, 6, 8-21), but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for CUBE-LLM or a link to a code repository. It references existing open-source architectures (LLaVA-1.5, DINOv2) but not its own implementation. |
| Open Datasets | Yes | We pre-train CUBE-LLM on LV3D, a large collection of 2D and 3D dataset (Table 1). We format the existing labels into multi-turn instruction-following tasks under standard format, as described in Section 3.1, 3.2, and 3.3. We describe details of dataset construction in the section C of the appendix. We evaluate our model on diverse tasks, including the following 3D grounding datasets. Talk2Car (Deruyttere et al., 2019)... Drive LM (Sima et al., 2023)... |
| Dataset Splits | Yes | Talk2Car (Deruyttere et al., 2019) is a 3D referring expression comprehension dataset... It consists of 8,349 training samples and 1,163 validation samples... We process Drive LM and construct a 3D grounding dataset... We sample 600 scenes for training and 96 scenes for validation... We hold out scene IDs: '64a3a2d22172406c848f2a92275808ba', '08be42eb2186411d8e2201225329f1c6', '4b5bf3f4668d44fea9a676e9c4a8a79e', '0e247ba64b9d4a34a7256b6c173b1b5d', 'dbd9183e1278475ea54761297e004b04', '4098aaf3c7074e7d87285e2fc95369e0', '9f3c8453d03d4df5946444757376b826', '2fc3753772e241f2ab2cd16a784cc680', 'd0880a386b6d434bb5cd13c134af7a3e', '01c3f5e39956402da3e37845632fadca' in our split evaluation. |
| Hardware Specification | Yes | During pretraining, we use 8 × 8 A100s with a batch size of 1024 and train the model with a learning rate lr = 2 × 10−5 on images with 336 × 336 resolution. Then, we fine-tune all parameters including the visual encoder on a higher resolution 672 × 672 with 8 × 8 A100s and a batch size of 256 with 4 gradient accumulation steps (effective batch size of 1024) and a learning rate lr = 2 × 10−5. |
| Software Dependencies | Yes | We use LLaVA-1.5 Liu et al. (2023a) with Vicuna-7B as our base model. We replace the CLIP visual encoder with ViT-L/14 Dosovitskiy et al. (2021) based DINOv2. |
| Experiment Setup | Yes | During pretraining, we use 8 × 8 A100s with a batch size of 1024 and train the model with a learning rate lr = 2 × 10−5 on images with 336 × 336 resolution. Then, we fine-tune all parameters including the visual encoder on a higher resolution 672 × 672 with 8 × 8 A100s and a batch size of 256 with 4 gradient accumulation steps (effective batch size of 1024) and a learning rate lr = 2 × 10−5. Figure 8: Context length: 2048 Image size: 336 x 336 and Context length: 4096 Image size: 672 x 672. |