reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Breaking Mental Set to Improve Reasoning through Diverse Multi-Agent Debate

Authors: Yexiang Liu, Jie Cao, Zekun Li, Ran He, Tieniu Tan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DMAD against various prompting techniques, including self-reflection and traditional MAD, across multiple benchmarks using both LLMs and Multimodal LLMs. Our experiments show that DMAD consistently outperforms other methods, delivering better results than MAD in fewer rounds.
Researcher Affiliation	Academia	1MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3University of California, Santa Barbara 4Nanjing University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 provides a comprehensive summary of the procedures involved in DMAD. Algorithm 1 DMAD algorithm Require: input query x, n model instances {Mi \| i = 1, 2, ..., n}, n reasoning methods {Ri \| i = 1, 2, ..., n}, n debate histories {hi \| i = 1, 2, ..., n} debate rounds N, judge ϕ 1: for Round j = 1, ..., N do 2: for Agent i = 1, ..., n do 3: si,j = Mi (x \| hi; Ri), Solving processes (Equation 1) 4: yi,j = Mi (x, si,j \| hi; Ri) Candidate answers (Equation 2) 5: end for 6: Ai,j = (x, si,j, yi,j), H = {Ai,j \| i = 1, 2, ..., n} Collecting messages (Equation 3) 7: for Agent i = 1, ..., n do 8: hi [{Ai,j}, H \ {Ai,j}] Updating histories (Equation 4) 9: end for 10: y j = ϕ ({yi,j \| i = 1, 2, ..., n}) Obtaining debate solutions (Equation 5) 11: end for
Open Source Code	Yes	Code is available at https://github.com/Mra Donkey/DMAD.
Open Datasets	Yes	Experiments are conducted on Large Language Models (LLMs) using text-only benchmarks, MATH (Hendrycks et al., 2021b) and GPQA (Rein et al., 2024), as well as on Multimodal Large Language Models (MLLMs) using multimodal benchmarks, Science QA (Lu et al., 2022) and MM-Vet (Yu et al., 2024b).
Dataset Splits	Yes	MATH (Hendrycks et al., 2021b) is a hard mathematics benchmark... We randomly select 100 test samples2 in each subject with random seed 0. [...] GPQA (Rein et al., 2024) is a challenging graduate-level Q&A benchmark... We test all methods and models on the whole dataset. [...] Science QA (Lu et al., 2022)... We use their QCM input format (Question, Context, Options) and test on all data containing images in the test split of Science QA, which comprises 2017 image-question pairs. GPT-4o is tested on 100 questions sampled using random seed 0. [...] MM-Vet (Yu et al., 2024b)... We test all MLLMs and methods on the whole dataset.
Hardware Specification	No	No specific hardware details (GPU models, CPU models, etc.) for running the experiments are provided in the paper. The paper only lists the LLM and MLLM models used for evaluation.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python version, PyTorch/TensorFlow version, CUDA version) are mentioned for the experimental setup. The paper mentions "Python program" in the context of PoT prompting but without version details.
Experiment Setup	Yes	We use their default settings and hyper-parameters. We select n = 3 distinct reasoning methods as R = {R1, R2.R3}. ... To make a fair comparison, we set n = 3 agents and N = 2 rounds for all MAD settings. We set ϕ as Self-Consistency to get a final solution in each debate round.