Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner

Authors: Chunhui Zhang, Zhongyu Ouyang, Kwonjoon Lee, Nakul Agarwal, Sean Dae Houlihan, Soroush Vosoughi, Shao-Yuan Lo

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate a 4.6% improvement in accuracy over state-of-the-art methods on multimodal To M benchmarks, including unseen scenarios, establishing a new standard for modeling human mental states in complex environments.
Researcher Affiliation Collaboration 1Dartmouth College, Hanover, NH, USA 2Honda Research Institute USA, San Jose, CA, USA.
Pseudocode No The paper describes the method using mathematical equations and prose (e.g., Section 3, equations 1-7), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our methods are designed to be fully reproducible, with detailed descriptions of datasets, experimental settings, and methodologies provided in this paper and our repository: https://github.com/ chunhuizng/scale-bayesian-planner.
Open Datasets Yes Datasets (i) For post-training, we use MMTo M sampled from an apartment environment simulator, Virtual Home (Puig et al., 2018), using the procedural methods described by Jin et al. (2024). The dataset comprises 1,000 procedurally synthesized videos within a realistic household simulator, each annotated with states, goals, beliefs, and actions. (ii) For evaluation, we use the MMTo M-QA (Jin et al., 2024), an evaluation benchmark aimed at evaluating To M reasoning over multimodal situations.
Dataset Splits Yes The training pool size N for post-training was set to 20,000 data points, sourced from the MMTo M dataset s training split and our released data sampled from an embodied simulator. For tasks involving transfer to new themes, the training dataset size remained consistent at 20,000 data points, ensuring a fair and uniform setup across different experiments.
Hardware Specification Yes The fine-tuning process for smaller models (e.g., Llama3.1-8B) was conducted using a single NVIDIA H100 GPU, leveraging BF16 mode to optimize memory usage and maintain GPU memory consumption under 60GB.
Software Dependencies No The paper mentions using Llama models (Llama2, Llama3, Llama3.1) and Lo RA, but it does not specify explicit version numbers for the software libraries or frameworks used, only hyperparameters for Lo RA.
Experiment Setup Yes Following the setup recommended by Jin et al. (2024), we use a learning rate of 1e-3 over 3 epochs. Lo RA is configured with a rank of 16 and an alpha value of 32 for the 7B and 8B LMs. For 70B, we use a lower rank of 8 and an alpha of 16. ... Batch size: 16 (achieved via a per-device batch size of 4 and gradient accumulation steps of 4), Learning rate: 5e-5, Number of epochs: 3.