reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

3D Question Answering via only 2D Vision-Language Models

Authors: Fengyun Wang, Sicheng Yu, Jiawei Wu, Jinhui Tang, Hanwang Zhang, Qianru Sun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate cd Views on the widely-used Scan QA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.
Researcher Affiliation	Academia	1Nanyang Technological University, Singapore 2Singapore Management University, Singapore 3National University of Singapore, Singapore 4Nanjing University of Science & Technology, Nanjing, China. Correspondence to: Qianru Sun <EMAIL>.
Pseudocode	No	The paper describes the cd Views framework, view Selector, view NMS, and view Annotator in detailed prose, but does not present any of these as structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https: //github.com/fereenwong/cd Views.
Open Datasets	Yes	We evaluate cd Views on the widely-used Scan QA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.
Dataset Splits	Yes	Scan QA contains over 41K question-answer annotations across 800 indoor 3D scenes, which are divided into train, val, and test sets (with or without objects).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	We utilize a recent state-of-the-art LVLM, i.e., LLAVA-OV-7B (Li et al., 2024a), as the 2D LVLM for all experiments, including view Annotator and 3D-QA. The model remains frozen throughout all experiments. Analysis on more LVLM backbones is shown in Appendix C.
Experiment Setup	Yes	Training of the view Selector is conducted with a learning rate of 5 * 10^-5 and a batch size of 8. Each training iteration samples 5 positive and 5 negative views per instance generated by view Annotator. Here the number of views, e.g., k=9 for cd Views, is selected on the validation set (Figure 4).