reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hypo3D: Exploring Hypothetical Reasoning in 3D

Authors: Ye Mao, Weixun Luo, Junpeng Jing, Anlan Qiu, Krystian Mikolajczyk

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that state-of-the-art foundation models struggle to reason effectively in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the change is irrelevant to the question, models often incorrectly adjust their answers. The code and dataset are publicly available at: https://matchlab-imperial. github.io/Hypo3D.
Researcher Affiliation	Academia	1Department of Electrical and Electronic Engineering, Imperial College London, United Kingdom. Correspondence to: Junpeng Jing <EMAIL>.
Pseudocode	No	The paper describes a dataset generation pipeline using a figure (Figure 3) and textual descriptions, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code and dataset are publicly available at: https://matchlab-imperial. github.io/Hypo3D.
Open Datasets	Yes	The code and dataset are publicly available at: https://matchlab-imperial. github.io/Hypo3D. The Hypo3D benchmark comprises 700 unique scenes, with 500 sourced from the Scan Net (Dai et al., 2017) dataset and 200 from the 3RScan (Wald et al., 2019) dataset, randomly sampled from their respective sources.
Dataset Splits	No	The paper describes the Hypo3D benchmark as comprising 7,727 context changes and 14,885 question-answer pairs for evaluation. It mentions sampling subsets for specific analyses (e.g., 'sampled 50 scenes and 250 context changes with 50 questions per change type for assessment') but does not provide standard training/validation/test splits for model training.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It lists the foundation models evaluated but not the hardware they were run on.
Software Dependencies	No	The paper mentions using SBERT (Reimers, 2019), GPT-4o, and GPT4-Turbo as part of its methodology for data generation and filtering. However, it does not provide specific version numbers for SBERT or any other programming languages, libraries, or frameworks used for implementing the core methodology described.
Experiment Setup	Yes	Our experiments primarily used the default inference hyperparameters for zero-shot models, as detailed in Table 8. For Claude 3.5 Sonnet, the maximum new token parameter was set to 40, reduced from its default value due to the model s tendency to generate lengthy responses, even when instructed to be concise.