reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

Authors: Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, Tianmin Shu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validated Mu MA-To M in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent To M model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal To M model, BIP-ALM. Experiments Human Experiment We recruited 18 participants (mean age = 36.0; 10 female) from Prolific to answer 90 questions randomly sampled from the benchmark. Each question received responses from 3 participants. The experiment was approved by an institutional review board. Baselines We evaluated our benchmark on state-of-the-art LMMs. [...] Results We report the human and model performance in Figure 5 and Table 1.
Researcher Affiliation	Academia	1Johns Hopkins University, 2University of Virginia EMAIL, EMAIL
Pseudocode	No	The paper describes the LIMP model architecture and components conceptually and through diagrams, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections with structured, code-like steps.
Open Source Code	Yes	Code and data https://scai.cs.jhu.edu/projects/Mu MA-To M/
Open Datasets	Yes	Code and data https://scai.cs.jhu.edu/projects/Mu MA-To M/ [...] We introduce a new Theory of Mind benchmark, Mu MA-To M (Multi-modal Multi Agent Theory of Mind benchmark). Mu MA-To M includes a large set of question-answering trials. [...] We also created a training set consisting of 1,030 videos annotated with the agents actions and goals.
Dataset Splits	No	The paper mentions a training set of 1,030 videos and that the benchmark consists of 225 social interactions and 900 questions. It also states that 90 questions were randomly sampled for the human experiment. However, it does not provide specific train/validation/test splits (percentages, counts, or explicit methodology) for the 900 benchmark questions or for the model evaluations described.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100, RTX 2080 Ti), CPU models, or detailed cloud computing specifications used for running the experiments or training the models.
Software Dependencies	Yes	We evaluated GPT-4o (Open AI 2023), Llava 1.6 (Liu et al. 2023), Gemini 1.5 (Reid et al. 2024), Intern VL2 (Chen et al. 2023) and Video Llama 2 (Cheng et al. 2024). We evaluated the latest version of each LMM at the time of submission. For LIMP, we use Gemini 1.5 Pro as the VLM and GPT-4o as the LLM.
Experiment Setup	No	The paper does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or other training configurations for the LIMP model or the evaluated baselines. While finetuning is mentioned for Video Llama 2, no specifics of the finetuning process are provided.