reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mixture-of-Agents Enhances Large Language Model Capabilities

Authors: Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Y Zou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive evaluations using Alpaca Eval 2.0, Arena-Hard (Li et al., 2024), MTBench (Zheng et al., 2023), FLASK (Ye et al., 2023) benchmarks for assessing the response quality across various dimensions. The results demonstrate substantial improvements with our proposed method, achieving SOTA win rate of 65.8% on Alpaca Eval 2.0, outperforming GPT-4 Omni.
Researcher Affiliation	Collaboration	1Duke University 2Together AI 3University of Chicago 4Stanford University
Pseudocode	No	The paper describes its methodology verbally and illustrates its structure with Figure 2 and provides a prompt template in Table 1, but it does not include a distinct section or block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	1https://github.com/togethercomputer/moa.
Open Datasets	Yes	We conduct comprehensive evaluations using Alpaca Eval 2.0 (Dubois et al., 2024), Arena-Hard (Li et al., 2024), MT-Bench (Zheng et al., 2023), FLASK (Ye et al., 2023) benchmarks for assessing the response quality across various dimensions. Additionally, we use the MATH dataset Hendrycks et al. (2021b), Big-Bench Hard (BBH) Suzgun et al. (2023), MMLU Hendrycks et al. (2021a) and CSQA Talmor et al. (2021).
Dataset Splits	No	The paper mentions that Alpaca Eval 2.0 contains 805 instructions and Arena-Hard contains 500 challenging user queries, but it does not explicitly specify training, validation, or test splits for any of the datasets used in its experiments. Benchmarks like Alpaca Eval are primarily for evaluation, and the paper does not detail any custom splitting methodology for reproducibility beyond referring to these benchmarks.
Hardware Specification	No	The paper states: "For open-source models, all inferences were ran through Together Inference Endpoint." It also discusses "tflops" as a proxy for latency but does not specify any particular GPU models, CPU types, or detailed computing infrastructure (e.g., "NVIDIA A100", "Intel Xeon"). The reference to GPT-4's "rumored size from the community of an 8x220B architecture" refers to the model itself, not the hardware used by the authors for their experiments.
Software Dependencies	No	The paper states: "Our method does not require any fine-tuning and only utilizes the interface of prompting and generation of LLMs." It leverages various large language models (e.g., GPT-4o, Qwen1.5, LLaMA-3) but does not provide specific version numbers for any underlying software libraries, frameworks, or programming languages used for the implementation of the Mixture-of-Agents framework.
Experiment Setup	Yes	We construct 3 Mo A layers and use the same set of models in each Mo A layer. We use Qwen1.5-110B-Chat as the aggregator in the last layer. We also developed a variant called Mo A w/ GPT-4o, which prioritizes high-quality outputs by using GPT-4o as the aggregator in the final Mo A layer. Another variant, Mo A-Lite, emphasizes cost-effectiveness. It uses the same set of models as proposers but includes only 2 Mo A layers and employs Qwen1.5-72B-Chat as the aggregator. We compare two settings: single-proposer where the n responses are generated by the same LLM with a temperature of 0.7.