Breaking Mental Set to Improve Reasoning through Diverse Multi-Agent Debate
Authors: Yexiang Liu, Jie Cao, Zekun Li, Ran He, Tieniu Tan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DMAD against various prompting techniques, including self-reflection and traditional MAD, across multiple benchmarks using both LLMs and Multimodal LLMs. Our experiments show that DMAD consistently outperforms other methods, delivering better results than MAD in fewer rounds. |
| Researcher Affiliation | Academia | 1MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3University of California, Santa Barbara 4Nanjing University EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 provides a comprehensive summary of the procedures involved in DMAD. Algorithm 1 DMAD algorithm Require: input query x, n model instances {Mi | i = 1, 2, ..., n}, n reasoning methods {Ri | i = 1, 2, ..., n}, n debate histories {hi | i = 1, 2, ..., n} debate rounds N, judge ϕ 1: for Round j = 1, ..., N do 2: for Agent i = 1, ..., n do 3: si,j = Mi (x | hi; Ri), Solving processes (Equation 1) 4: yi,j = Mi (x, si,j | hi; Ri) Candidate answers (Equation 2) 5: end for 6: Ai,j = (x, si,j, yi,j), H = {Ai,j | i = 1, 2, ..., n} Collecting messages (Equation 3) 7: for Agent i = 1, ..., n do 8: hi [{Ai,j}, H \ {Ai,j}] Updating histories (Equation 4) 9: end for 10: y j = ϕ ({yi,j | i = 1, 2, ..., n}) Obtaining debate solutions (Equation 5) 11: end for |
| Open Source Code | Yes | Code is available at https://github.com/Mra Donkey/DMAD. |
| Open Datasets | Yes | Experiments are conducted on Large Language Models (LLMs) using text-only benchmarks, MATH (Hendrycks et al., 2021b) and GPQA (Rein et al., 2024), as well as on Multimodal Large Language Models (MLLMs) using multimodal benchmarks, Science QA (Lu et al., 2022) and MM-Vet (Yu et al., 2024b). |
| Dataset Splits | Yes | MATH (Hendrycks et al., 2021b) is a hard mathematics benchmark... We randomly select 100 test samples2 in each subject with random seed 0. [...] GPQA (Rein et al., 2024) is a challenging graduate-level Q&A benchmark... We test all methods and models on the whole dataset. [...] Science QA (Lu et al., 2022)... We use their QCM input format (Question, Context, Options) and test on all data containing images in the test split of Science QA, which comprises 2017 image-question pairs. GPT-4o is tested on 100 questions sampled using random seed 0. [...] MM-Vet (Yu et al., 2024b)... We test all MLLMs and methods on the whole dataset. |
| Hardware Specification | No | No specific hardware details (GPU models, CPU models, etc.) for running the experiments are provided in the paper. The paper only lists the LLM and MLLM models used for evaluation. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python version, PyTorch/TensorFlow version, CUDA version) are mentioned for the experimental setup. The paper mentions "Python program" in the context of PoT prompting but without version details. |
| Experiment Setup | Yes | We use their default settings and hyper-parameters. We select n = 3 distinct reasoning methods as R = {R1, R2.R3}. ... To make a fair comparison, we set n = 3 agents and N = 2 rounds for all MAD settings. We set ϕ as Self-Consistency to get a final solution in each debate round. |