3D Question Answering via only 2D Vision-Language Models
Authors: Fengyun Wang, Sicheng Yu, Jiawei Wu, Jinhui Tang, Hanwang Zhang, Qianru Sun
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate cd Views on the widely-used Scan QA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks. |
| Researcher Affiliation | Academia | 1Nanyang Technological University, Singapore 2Singapore Management University, Singapore 3National University of Singapore, Singapore 4Nanjing University of Science & Technology, Nanjing, China. Correspondence to: Qianru Sun <EMAIL>. |
| Pseudocode | No | The paper describes the cd Views framework, view Selector, view NMS, and view Annotator in detailed prose, but does not present any of these as structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https: //github.com/fereenwong/cd Views. |
| Open Datasets | Yes | We evaluate cd Views on the widely-used Scan QA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks. |
| Dataset Splits | Yes | Scan QA contains over 41K question-answer annotations across 800 indoor 3D scenes, which are divided into train, val, and test sets (with or without objects). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | We utilize a recent state-of-the-art LVLM, i.e., LLAVA-OV-7B (Li et al., 2024a), as the 2D LVLM for all experiments, including view Annotator and 3D-QA. The model remains frozen throughout all experiments. Analysis on more LVLM backbones is shown in Appendix C. |
| Experiment Setup | Yes | Training of the view Selector is conducted with a learning rate of 5 * 10^-5 and a batch size of 8. Each training iteration samples 5 positive and 5 negative views per instance generated by view Annotator. Here the number of views, e.g., k=9 for cd Views, is selected on the validation set (Figure 4). |