Mixture-of-Agents Enhances Large Language Model Capabilities

Authors: Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Y Zou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive evaluations using Alpaca Eval 2.0, Arena-Hard (Li et al., 2024), MTBench (Zheng et al., 2023), FLASK (Ye et al., 2023) benchmarks for assessing the response quality across various dimensions. The results demonstrate substantial improvements with our proposed method, achieving SOTA win rate of 65.8% on Alpaca Eval 2.0, outperforming GPT-4 Omni.
Researcher Affiliation Collaboration 1Duke University 2Together AI 3University of Chicago 4Stanford University
Pseudocode No The paper describes its methodology verbally and illustrates its structure with Figure 2 and provides a prompt template in Table 1, but it does not include a distinct section or block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes 1https://github.com/togethercomputer/moa.
Open Datasets Yes We conduct comprehensive evaluations using Alpaca Eval 2.0 (Dubois et al., 2024), Arena-Hard (Li et al., 2024), MT-Bench (Zheng et al., 2023), FLASK (Ye et al., 2023) benchmarks for assessing the response quality across various dimensions. Additionally, we use the MATH dataset Hendrycks et al. (2021b), Big-Bench Hard (BBH) Suzgun et al. (2023), MMLU Hendrycks et al. (2021a) and CSQA Talmor et al. (2021).
Dataset Splits No The paper mentions that Alpaca Eval 2.0 contains 805 instructions and Arena-Hard contains 500 challenging user queries, but it does not explicitly specify training, validation, or test splits for any of the datasets used in its experiments. Benchmarks like Alpaca Eval are primarily for evaluation, and the paper does not detail any custom splitting methodology for reproducibility beyond referring to these benchmarks.
Hardware Specification No The paper states: "For open-source models, all inferences were ran through Together Inference Endpoint." It also discusses "tflops" as a proxy for latency but does not specify any particular GPU models, CPU types, or detailed computing infrastructure (e.g., "NVIDIA A100", "Intel Xeon"). The reference to GPT-4's "rumored size from the community of an 8x220B architecture" refers to the model itself, not the hardware used by the authors for their experiments.
Software Dependencies No The paper states: "Our method does not require any fine-tuning and only utilizes the interface of prompting and generation of LLMs." It leverages various large language models (e.g., GPT-4o, Qwen1.5, LLaMA-3) but does not provide specific version numbers for any underlying software libraries, frameworks, or programming languages used for the implementation of the Mixture-of-Agents framework.
Experiment Setup Yes We construct 3 Mo A layers and use the same set of models in each Mo A layer. We use Qwen1.5-110B-Chat as the aggregator in the last layer. We also developed a variant called Mo A w/ GPT-4o, which prioritizes high-quality outputs by using GPT-4o as the aggregator in the final Mo A layer. Another variant, Mo A-Lite, emphasizes cost-effectiveness. It uses the same set of models as proposers but includes only 2 Mo A layers and employs Qwen1.5-72B-Chat as the aggregator. We compare two settings: single-proposer where the n responses are generated by the same LLM with a temperature of 0.7.