Breaking Mental Set to Improve Reasoning through Diverse Multi-Agent Debate

Authors: Yexiang Liu, Jie Cao, Zekun Li, Ran He, Tieniu Tan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate DMAD against various prompting techniques, including self-reflection and traditional MAD, across multiple benchmarks using both LLMs and Multimodal LLMs. Our experiments show that DMAD consistently outperforms other methods, delivering better results than MAD in fewer rounds.
Researcher Affiliation Academia 1MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3University of California, Santa Barbara 4Nanjing University EMAIL, EMAIL
Pseudocode Yes Algorithm 1 provides a comprehensive summary of the procedures involved in DMAD. Algorithm 1 DMAD algorithm Require: input query x, n model instances {Mi | i = 1, 2, ..., n}, n reasoning methods {Ri | i = 1, 2, ..., n}, n debate histories {hi | i = 1, 2, ..., n} debate rounds N, judge ϕ 1: for Round j = 1, ..., N do 2: for Agent i = 1, ..., n do 3: si,j = Mi (x | hi; Ri), Solving processes (Equation 1) 4: yi,j = Mi (x, si,j | hi; Ri) Candidate answers (Equation 2) 5: end for 6: Ai,j = (x, si,j, yi,j), H = {Ai,j | i = 1, 2, ..., n} Collecting messages (Equation 3) 7: for Agent i = 1, ..., n do 8: hi [{Ai,j}, H \ {Ai,j}] Updating histories (Equation 4) 9: end for 10: y j = ϕ ({yi,j | i = 1, 2, ..., n}) Obtaining debate solutions (Equation 5) 11: end for
Open Source Code Yes Code is available at https://github.com/Mra Donkey/DMAD.
Open Datasets Yes Experiments are conducted on Large Language Models (LLMs) using text-only benchmarks, MATH (Hendrycks et al., 2021b) and GPQA (Rein et al., 2024), as well as on Multimodal Large Language Models (MLLMs) using multimodal benchmarks, Science QA (Lu et al., 2022) and MM-Vet (Yu et al., 2024b).
Dataset Splits Yes MATH (Hendrycks et al., 2021b) is a hard mathematics benchmark... We randomly select 100 test samples2 in each subject with random seed 0. [...] GPQA (Rein et al., 2024) is a challenging graduate-level Q&A benchmark... We test all methods and models on the whole dataset. [...] Science QA (Lu et al., 2022)... We use their QCM input format (Question, Context, Options) and test on all data containing images in the test split of Science QA, which comprises 2017 image-question pairs. GPT-4o is tested on 100 questions sampled using random seed 0. [...] MM-Vet (Yu et al., 2024b)... We test all MLLMs and methods on the whole dataset.
Hardware Specification No No specific hardware details (GPU models, CPU models, etc.) for running the experiments are provided in the paper. The paper only lists the LLM and MLLM models used for evaluation.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python version, PyTorch/TensorFlow version, CUDA version) are mentioned for the experimental setup. The paper mentions "Python program" in the context of PoT prompting but without version details.
Experiment Setup Yes We use their default settings and hyper-parameters. We select n = 3 distinct reasoning methods as R = {R1, R2.R3}. ... To make a fair comparison, we set n = 3 agents and N = 2 rounds for all MAD settings. We set ϕ as Self-Consistency to get a final solution in each debate round.