reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner

Authors: Chunhui Zhang, Zhongyu Ouyang, Kwonjoon Lee, Nakul Agarwal, Sean Dae Houlihan, Soroush Vosoughi, Shao-Yuan Lo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate a 4.6% improvement in accuracy over state-of-the-art methods on multimodal To M benchmarks, including unseen scenarios, establishing a new standard for modeling human mental states in complex environments.
Researcher Affiliation	Collaboration	1Dartmouth College, Hanover, NH, USA 2Honda Research Institute USA, San Jose, CA, USA.
Pseudocode	No	The paper describes the method using mathematical equations and prose (e.g., Section 3, equations 1-7), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Our methods are designed to be fully reproducible, with detailed descriptions of datasets, experimental settings, and methodologies provided in this paper and our repository: https://github.com/ chunhuizng/scale-bayesian-planner.
Open Datasets	Yes	Datasets (i) For post-training, we use MMTo M sampled from an apartment environment simulator, Virtual Home (Puig et al., 2018), using the procedural methods described by Jin et al. (2024). The dataset comprises 1,000 procedurally synthesized videos within a realistic household simulator, each annotated with states, goals, beliefs, and actions. (ii) For evaluation, we use the MMTo M-QA (Jin et al., 2024), an evaluation benchmark aimed at evaluating To M reasoning over multimodal situations.
Dataset Splits	Yes	The training pool size N for post-training was set to 20,000 data points, sourced from the MMTo M dataset s training split and our released data sampled from an embodied simulator. For tasks involving transfer to new themes, the training dataset size remained consistent at 20,000 data points, ensuring a fair and uniform setup across different experiments.
Hardware Specification	Yes	The fine-tuning process for smaller models (e.g., Llama3.1-8B) was conducted using a single NVIDIA H100 GPU, leveraging BF16 mode to optimize memory usage and maintain GPU memory consumption under 60GB.
Software Dependencies	No	The paper mentions using Llama models (Llama2, Llama3, Llama3.1) and Lo RA, but it does not specify explicit version numbers for the software libraries or frameworks used, only hyperparameters for Lo RA.
Experiment Setup	Yes	Following the setup recommended by Jin et al. (2024), we use a learning rate of 1e-3 over 3 epochs. Lo RA is configured with a rank of 16 and an alpha value of 32 for the 7B and 8B LMs. For 70B, we use a lower rank of 8 and an alpha of 16. ... Batch size: 16 (achieved via a per-device batch size of 4 and gradient accumulation steps of 4), Learning rate: 5e-5, Number of epochs: 3.