Hypo3D: Exploring Hypothetical Reasoning in 3D

Authors: Ye Mao, Weixun Luo, Junpeng Jing, Anlan Qiu, Krystian Mikolajczyk

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that state-of-the-art foundation models struggle to reason effectively in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the change is irrelevant to the question, models often incorrectly adjust their answers. The code and dataset are publicly available at: https://matchlab-imperial. github.io/Hypo3D.
Researcher Affiliation Academia 1Department of Electrical and Electronic Engineering, Imperial College London, United Kingdom. Correspondence to: Junpeng Jing <EMAIL>.
Pseudocode No The paper describes a dataset generation pipeline using a figure (Figure 3) and textual descriptions, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code and dataset are publicly available at: https://matchlab-imperial. github.io/Hypo3D.
Open Datasets Yes The code and dataset are publicly available at: https://matchlab-imperial. github.io/Hypo3D. The Hypo3D benchmark comprises 700 unique scenes, with 500 sourced from the Scan Net (Dai et al., 2017) dataset and 200 from the 3RScan (Wald et al., 2019) dataset, randomly sampled from their respective sources.
Dataset Splits No The paper describes the Hypo3D benchmark as comprising 7,727 context changes and 14,885 question-answer pairs for evaluation. It mentions sampling subsets for specific analyses (e.g., 'sampled 50 scenes and 250 context changes with 50 questions per change type for assessment') but does not provide standard training/validation/test splits for model training.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It lists the foundation models evaluated but not the hardware they were run on.
Software Dependencies No The paper mentions using SBERT (Reimers, 2019), GPT-4o, and GPT4-Turbo as part of its methodology for data generation and filtering. However, it does not provide specific version numbers for SBERT or any other programming languages, libraries, or frameworks used for implementing the core methodology described.
Experiment Setup Yes Our experiments primarily used the default inference hyperparameters for zero-shot models, as detailed in Table 8. For Claude 3.5 Sonnet, the maximum new token parameter was set to 40, reduced from its default value due to the model s tendency to generate lengthy responses, even when instructed to be concise.