MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
Authors: Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, Tianmin Shu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validated Mu MA-To M in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent To M model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal To M model, BIP-ALM. Experiments Human Experiment We recruited 18 participants (mean age = 36.0; 10 female) from Prolific to answer 90 questions randomly sampled from the benchmark. Each question received responses from 3 participants. The experiment was approved by an institutional review board. Baselines We evaluated our benchmark on state-of-the-art LMMs. [...] Results We report the human and model performance in Figure 5 and Table 1. |
| Researcher Affiliation | Academia | 1Johns Hopkins University, 2University of Virginia EMAIL, EMAIL |
| Pseudocode | No | The paper describes the LIMP model architecture and components conceptually and through diagrams, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections with structured, code-like steps. |
| Open Source Code | Yes | Code and data https://scai.cs.jhu.edu/projects/Mu MA-To M/ |
| Open Datasets | Yes | Code and data https://scai.cs.jhu.edu/projects/Mu MA-To M/ [...] We introduce a new Theory of Mind benchmark, Mu MA-To M (Multi-modal Multi Agent Theory of Mind benchmark). Mu MA-To M includes a large set of question-answering trials. [...] We also created a training set consisting of 1,030 videos annotated with the agents actions and goals. |
| Dataset Splits | No | The paper mentions a training set of 1,030 videos and that the benchmark consists of 225 social interactions and 900 questions. It also states that 90 questions were randomly sampled for the human experiment. However, it does not provide specific train/validation/test splits (percentages, counts, or explicit methodology) for the 900 benchmark questions or for the model evaluations described. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100, RTX 2080 Ti), CPU models, or detailed cloud computing specifications used for running the experiments or training the models. |
| Software Dependencies | Yes | We evaluated GPT-4o (Open AI 2023), Llava 1.6 (Liu et al. 2023), Gemini 1.5 (Reid et al. 2024), Intern VL2 (Chen et al. 2023) and Video Llama 2 (Cheng et al. 2024). We evaluated the latest version of each LMM at the time of submission. For LIMP, we use Gemini 1.5 Pro as the VLM and GPT-4o as the LLM. |
| Experiment Setup | No | The paper does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or other training configurations for the LIMP model or the evaluated baselines. While finetuning is mentioned for Video Llama 2, no specifics of the finetuning process are provided. |