3D Question Answering via only 2D Vision-Language Models

Authors: Fengyun Wang, Sicheng Yu, Jiawei Wu, Jinhui Tang, Hanwang Zhang, Qianru Sun

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate cd Views on the widely-used Scan QA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.
Researcher Affiliation Academia 1Nanyang Technological University, Singapore 2Singapore Management University, Singapore 3National University of Singapore, Singapore 4Nanjing University of Science & Technology, Nanjing, China. Correspondence to: Qianru Sun <EMAIL>.
Pseudocode No The paper describes the cd Views framework, view Selector, view NMS, and view Annotator in detailed prose, but does not present any of these as structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https: //github.com/fereenwong/cd Views.
Open Datasets Yes We evaluate cd Views on the widely-used Scan QA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.
Dataset Splits Yes Scan QA contains over 41K question-answer annotations across 800 indoor 3D scenes, which are divided into train, val, and test sets (with or without objects).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No We utilize a recent state-of-the-art LVLM, i.e., LLAVA-OV-7B (Li et al., 2024a), as the 2D LVLM for all experiments, including view Annotator and 3D-QA. The model remains frozen throughout all experiments. Analysis on more LVLM backbones is shown in Appendix C.
Experiment Setup Yes Training of the view Selector is conducted with a learning rate of 5 * 10^-5 and a batch size of 8. Each training iteration samples 5 positive and 5 negative views per instance generated by view Annotator. Here the number of views, e.g., k=9 for cd Views, is selected on the validation set (Figure 4).