reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Balancing Act: Diversity and Consistency in Large Language Model Ensembles

Authors: Ahmed Abdulaal, Chen Jin, Nina Montaña-Brown, Aryo Pradipta Gema, Daniel Castro, Daniel Alexander, Philip Teare, Tom Diethe, Dino Oglic, Amrutha Saseendran

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	More specifically, we introduce a consistency score that defines a gating mechanism for mixtures of agents and an algorithm for mixture refinement to investigate these trade-offs at the semantic and model levels, respectively. We incorporate our insights into a novel inference-time LLM ensembling strategy called the Dynamic Mixture of Agents (DMo A) and demonstrate that it achieves a new state-of-the-art result in the challenging Big Bench Hard mixed evaluations benchmark. Our analysis reveals that cross-validation bias can enhance performance, contingent on the expertise of the constituent models. We further demonstrate that distinct reasoning tasks such as arithmetic reasoning, commonsense reasoning, and instruction following require different model capabilities, leading to inherent task-dependent trade-offs that DMo A can balance effectively.
Researcher Affiliation	Collaboration	1Centre for AI, Data Science & Artificial Intelligence, R&D, Astra Zeneca, UK 2Centre for Medical Image Computing, UCL, London, UK 3Microsoft Research, Cambridge, UK
Pseudocode	Yes	Algorithm 1: Mixture optimization experiment algorithm.
Open Source Code	Yes	We provide details on the experimental setup, hyperparameters, and data preprocessing steps to maximize the reproducibility of our results. The codebase can be found at: https://anonymous.4open.science/r/balancing_act-CB59/.
Open Datasets	Yes	Arithmetic reasoning: We used GSM8K (Cobbe et al., 2021), a set of linguistically diverse grade school math word problems, and MATH (Hendrycks et al., 2021), a set of challenging competition mathematics problems which require step-by-step reasoning to solve. Commonsense reasoning: For these tasks, we used Commonsense QA (CSQA) (Talmor et al., 2018), a popular commonsense question answering dataset constructed by extracting target concepts from text that have the same semantic relation to a single source concept; questions then aim to discriminate between the target concepts. Correctly solving the questions often requires prior knowledge. We also used the AI2 Reasoning Challenge (Clark et al., 2018), a dataset of natural grade-school science questions authored for human tests. Instruction following: We considered Alpaca Eval 2.0 (Dubois et al., 2024), which uses a reference-free method (GPT-4) to evaluate the quality of outputs by how aligned they are with human preferences. The benchmark calculates a length-controlled win-rate which explicitly accounts for the confounding whereby models prefer longer answers. This benchmark aligns strongly with human preference and has a Spearman correlation of 0.98 with LMSYS Chatbot Arena. We also considered MT-Bench (Zheng et al., 2024), a multi-turn question set which uses LLM judges to evaluate answers, and which demonstrates strong agreement with human preferences.
Dataset Splits	Yes	We perform all arithmetic and commonsense reasoning tasks in the few-shot setting. For arithmetic reasoning, we generate a 5-shot Co T prompt by randomly selecting samples from the training set of each benchmark for each question. For the CSQA task, we used the same prompt as in (Wang et al., 2022; Wei et al., 2022). For the AI2 Reasoning Challenge tasks, we manually construct a 4-shot Co T prompt which is shown in Listing 1.
Hardware Specification	No	The paper discusses various LLMs and their performance, and includes a cost analysis for different models/APIs (e.g., 'gpt-4o-2025-05-13', 'Claude 3.5 Sonnet'), but does not explicitly state the specific hardware (GPU/CPU models, memory, etc.) used to run the experiments. References to pricing information from API providers suggest reliance on external services rather than specific local hardware.
Software Dependencies	Yes	We used Open AI s text-embedding-3-small model (Open AI, 2024a) as the sentence-embedding function e( ) (described in more detail in Sec. 4.1). Namely, we use Llama-3-70B-Instruct (Touvron et al., 2023), Qwen1.5-72B-Chat (Bai et al., 2023), Qwen1.5-110B-Chat (Bai et al., 2023), Mixtral-8x22B-v0.1 (Jiang et al., 2024), Wizard LM-8x22B (Xu et al., 2023), and dbrx-instruct (The Mosaic Research Team, 2024). Namely, we use openai/gpt-3.5-turbo (Open AI, 2024b) to evaluate the model responses.
Experiment Setup	Yes	We perform all arithmetic and commonsense reasoning tasks in the few-shot setting. For arithmetic reasoning, we generate a 5-shot Co T prompt by randomly selecting samples from the training set of each benchmark for each question. For the CSQA task, we used the same prompt as in (Wang et al., 2022; Wei et al., 2022). For the AI2 Reasoning Challenge tasks, we manually construct a 4-shot Co T prompt which is shown in Listing 1. In view of this, to generate LLM outputs we used temperature sampling (T) (Ackley et al., 1985; Ficler & Goldberg, 2017), with T = 0.7 for all LLMs across all experiments. We do not consider top-k truncation (Radford et al., 2019; Fan et al., 2018; Holtzman et al., 2018) or nucleus sampling (Holtzman et al., 2019).