Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

Authors: Shangbin Feng, Zifeng Wang, Yike Wang, Sayna Ebrahimi, Hamid Palangi, Lesly Miculicich, Achin Kulshrestha, Nathalie Rauschmayr, Yejin Choi, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that MODEL SWARMS could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts. Further analysis reveals that LLM experts discover previously unseen capabilities in initial checkpoints and that MODEL SWARMS enable the weak-to-strong transition of experts through the collaborative search process.
Researcher Affiliation Collaboration 1University of Washington 2Google Cloud AI Research 3Google Deep Mind 4Google 5Stanford University. Correspondence to: Shangbin Feng <EMAIL>, Zifeng Wang <EMAIL>, Chen-Yu Lee <EMAIL>.
Pseudocode Yes We present an overview of MODEL SWARMS in Figure 1 and Algorithm 1. ... Algorithm 1: Model Swarms
Open Source Code Yes Code and data available at https://github.com/Bunsen Feng/model swarm.
Open Datasets Yes We investigate whether MODEL SWARMS could adapt LLM experts via collaborative search on four types of adaptation objectives and the corresponding utility functions. Single task: we employ 9 datasets spanning knowledge (MMLU (Hendrycks et al., 2021), MMLU-pro (Wang et al., 2024e), Hellaswag (Zellers et al., 2019)), reasoning (GSM8k (Cobbe et al., 2021), Knowledge Crosswords (Ding et al., 2024), NLGraph (Wang et al., 2024a; Zhang et al., 2024a)), and safety (Truthful QA (Lin et al., 2022), Real Toxicity Prompts (Gehman et al., 2020), Abstain QA (Feng et al., 2024a)).
Dataset Splits Yes We by default randomly sample 200 and 1000 samples as the validation/test sets: the utility function f is defined as performance on the validation set. ... We randomly sample subsets from each dataset and present the statistics in Table 10.
Hardware Specification Yes We empirically analyze the time complexity of employing 1 to 10 GPUs on our cluster of 16 A100 GPUs with 96 CPU cores with 10 default initial experts.
Software Dependencies No The paper mentions `GEMMA-7B`, `MISTRAL-7B`, `Lo RA`, `ROBERTA-BASE` as models or techniques, and `Python` implicitly, but no specific version numbers for software libraries or environments are provided in the main text to ensure reproducibility.
Experiment Setup Yes We fine-tune for 5 epochs with a starting learning rate of 2e-4 and effective batch size of 32 by default. For MODEL SWARMS searches, we employ N = 20, ϕλ = 0.95, p = 10, pr = 5, K = 50, while running grid search over other hyperparameters and report the best-found expert based on utility function f. Specifically, we search for ϕv {0.1, 0.2, 0.3}, ϕp {0.1, 0.2, 0.3, 0.4, 0.5}, ϕg {0.2, 0.3, 0.4, 0.5, 0.6}, ϕw {0.01, 0.05, 0.1}, λ {0.5, 0.6, 0.7, 0.8, 0.9, 1.0}.