reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

COMMA: A Communicative Multimodal Multi-Agent Benchmark

Authors: Timothy Ossowski, Danyal Maqbool, Jixuan Chen, Zefan Cai, Tyler J. Bradshaw, Junjie Hu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we describe the experimental settings of our multi-agent interaction environment where two distinct agents, namely the Solver agent and the Expert agent, engage in iterative dialogue sessions. The primary aim of this setup is to assess the collaborative problem-solving capabilities between different agents. During our experiments, we limit the number of conversation turns to 10 and the number of mistakes to 3, allowing for a unified and systematic assessment of interactions. The puzzle set used in evaluation consists of 100 fixed but different initializations of each of the 10 puzzles, resulting in 1000 total conversations. For most models, we use greedy decoding when available to maintain consistent agent output across different runs of the same puzzle. However, for reasoning models we set the temperature to 0.6 to avoid endless repetition. All inference is run on a single NVIDIA A100 GPU with 80GB VRAM. We parse the solver s chosen actions at each conversation turn using exact string matching and directly perform the action on the interface if the solver outputs a valid action. Exact prompts for both agents are in Appendix D. 4.2 Evaluation Metrics We recorded several key performance metrics through multiple iterations of the experiments described below: Success Rate (SR): The solver agent is assigned a 0 or 100 value for each puzzle depending on the completion status. These values are averaged across all puzzles to obtain the success rate. Partial Success Rate (PSR): Because our benchmark includes puzzles with multi-step reasoning, some puzzles can have a more precise success rate evaluation. For these multi-step puzzles, we assign the solver a number between 0 and 100 to indicate its progress towards the solution, and average across puzzles for a partial success rate. For single-step puzzles, the partial success rate equals the success rate. Efficiency Score: Effective communication is concise but meaningful. Inspired by the classical metric BLEU (Papineni et al., 2002) that balances the n-gram precisions with a length penalty, we propose a new metric which balances success rate and conciseness: Sconciseness = 1 1 + τ 1000 , Sefficiency = 2 PSR Sconciseness PSR + Sconciseness (1) where τ is the average token usage per puzzle. We use the harmonic mean between the conciseness and performance scores to ensure that both need to be high to indicate good performance. Average Mistakes (AM): After an action is chosen by the solver, the environment checks if the action was a mistake. We tally up the mistakes made during each puzzle and take a global average across puzzles to obtain average mistakes. Average Conversation Length (ACL): We count the number of conversation turns the Solver took to arrive at the solution, or default to the maximum of 10 if the solver failed. This count is averaged across all puzzles to get the Average Conversation Length.
Researcher Affiliation	Academia	Timothy Ossowski EMAIL Department of Computer Sciences University of Wisconsin-Madison Danyal Maqbool EMAIL Department of Computer Sciences University of Wisconsin-Madison Jixuan Chen EMAIL Department of Computer Sciences UC San Diego Zefan Cai EMAIL Department of Computer Sciences University of Wisconsin-Madison Tyler Bradshaw EMAIL Department of Radiology University of Wisconsin-Madison Junjie Hu EMAIL Department of Computer Sciences Department of Biostatistics and Medical Informatics University of Wisconsin-Madison
Pseudocode	No	The paper describes the design principles, categories of agent cognitive capability, and experimental setup in detail, but it does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	1Our data and code is available at https://github.com/tossowski/COMMA
Open Datasets	Yes	Telehealth Puzzle (PR ): The solver is in a health crisis situation and presented a private image of their skin and their background information (sourced from PAD-UFES-20 (Pacheco et al., 2020)).
Dataset Splits	Yes	The puzzle set used in evaluation consists of 100 fixed but different initializations of each of the 10 puzzles, resulting in 1000 total conversations. For most models, we use greedy decoding when available to maintain consistent agent output across different runs of the same puzzle. However, for reasoning models we set the temperature to 0.6 to avoid endless repetition. All inference is run on a single NVIDIA A100 GPU with 80GB VRAM. We parse the solver s chosen actions at each conversation turn using exact string matching and directly perform the action on the interface if the solver outputs a valid action. Exact prompts for both agents are in Appendix D.
Hardware Specification	Yes	All inference is run on a single NVIDIA A100 GPU with 80GB VRAM.
Software Dependencies	No	The paper discusses various models evaluated (e.g., GPT-4o, Gemini, Qwen VL, LLaMA 3.2, etc.) but does not specify the version numbers of software libraries or frameworks (like Python, PyTorch, or CUDA) used to implement or run the experiments.
Experiment Setup	Yes	In this section, we describe the experimental settings of our multi-agent interaction environment where two distinct agents, namely the Solver agent and the Expert agent, engage in iterative dialogue sessions. The primary aim of this setup is to assess the collaborative problem-solving capabilities between different agents. During our experiments, we limit the number of conversation turns to 10 and the number of mistakes to 3, allowing for a unified and systematic assessment of interactions. The puzzle set used in evaluation consists of 100 fixed but different initializations of each of the 10 puzzles, resulting in 1000 total conversations. For most models, we use greedy decoding when available to maintain consistent agent output across different runs of the same puzzle. However, for reasoning models we set the temperature to 0.6 to avoid endless repetition. All inference is run on a single NVIDIA A100 GPU with 80GB VRAM. We parse the solver s chosen actions at each conversation turn using exact string matching and directly perform the action on the interface if the solver outputs a valid action. Exact prompts for both agents are in Appendix D.