reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

Authors: Shangbin Feng, Zifeng Wang, Yike Wang, Sayna Ebrahimi, Hamid Palangi, Lesly Miculicich, Achin Kulshrestha, Nathalie Rauschmayr, Yejin Choi, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that MODEL SWARMS could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts. Further analysis reveals that LLM experts discover previously unseen capabilities in initial checkpoints and that MODEL SWARMS enable the weak-to-strong transition of experts through the collaborative search process.
Researcher Affiliation	Collaboration	1University of Washington 2Google Cloud AI Research 3Google Deep Mind 4Google 5Stanford University. Correspondence to: Shangbin Feng <EMAIL>, Zifeng Wang <EMAIL>, Chen-Yu Lee <EMAIL>.
Pseudocode	Yes	We present an overview of MODEL SWARMS in Figure 1 and Algorithm 1. ... Algorithm 1: Model Swarms
Open Source Code	Yes	Code and data available at https://github.com/Bunsen Feng/model swarm.
Open Datasets	Yes	We investigate whether MODEL SWARMS could adapt LLM experts via collaborative search on four types of adaptation objectives and the corresponding utility functions. Single task: we employ 9 datasets spanning knowledge (MMLU (Hendrycks et al., 2021), MMLU-pro (Wang et al., 2024e), Hellaswag (Zellers et al., 2019)), reasoning (GSM8k (Cobbe et al., 2021), Knowledge Crosswords (Ding et al., 2024), NLGraph (Wang et al., 2024a; Zhang et al., 2024a)), and safety (Truthful QA (Lin et al., 2022), Real Toxicity Prompts (Gehman et al., 2020), Abstain QA (Feng et al., 2024a)).
Dataset Splits	Yes	We by default randomly sample 200 and 1000 samples as the validation/test sets: the utility function f is defined as performance on the validation set. ... We randomly sample subsets from each dataset and present the statistics in Table 10.
Hardware Specification	Yes	We empirically analyze the time complexity of employing 1 to 10 GPUs on our cluster of 16 A100 GPUs with 96 CPU cores with 10 default initial experts.
Software Dependencies	No	The paper mentions `GEMMA-7B`, `MISTRAL-7B`, `Lo RA`, `ROBERTA-BASE` as models or techniques, and `Python` implicitly, but no specific version numbers for software libraries or environments are provided in the main text to ensure reproducibility.
Experiment Setup	Yes	We fine-tune for 5 epochs with a starting learning rate of 2e-4 and effective batch size of 32 by default. For MODEL SWARMS searches, we employ N = 20, ϕλ = 0.95, p = 10, pr = 5, K = 50, while running grid search over other hyperparameters and report the best-found expert based on utility function f. Specifically, we search for ϕv {0.1, 0.2, 0.3}, ϕp {0.1, 0.2, 0.3, 0.4, 0.5}, ϕg {0.2, 0.3, 0.4, 0.5, 0.6}, ϕw {0.01, 0.05, 0.1}, λ {0.5, 0.6, 0.7, 0.8, 0.9, 1.0}.