Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
Authors: Shangbin Feng, Zifeng Wang, Yike Wang, Sayna Ebrahimi, Hamid Palangi, Lesly Miculicich, Achin Kulshrestha, Nathalie Rauschmayr, Yejin Choi, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that MODEL SWARMS could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts. Further analysis reveals that LLM experts discover previously unseen capabilities in initial checkpoints and that MODEL SWARMS enable the weak-to-strong transition of experts through the collaborative search process. |
| Researcher Affiliation | Collaboration | 1University of Washington 2Google Cloud AI Research 3Google Deep Mind 4Google 5Stanford University. Correspondence to: Shangbin Feng <EMAIL>, Zifeng Wang <EMAIL>, Chen-Yu Lee <EMAIL>. |
| Pseudocode | Yes | We present an overview of MODEL SWARMS in Figure 1 and Algorithm 1. ... Algorithm 1: Model Swarms |
| Open Source Code | Yes | Code and data available at https://github.com/Bunsen Feng/model swarm. |
| Open Datasets | Yes | We investigate whether MODEL SWARMS could adapt LLM experts via collaborative search on four types of adaptation objectives and the corresponding utility functions. Single task: we employ 9 datasets spanning knowledge (MMLU (Hendrycks et al., 2021), MMLU-pro (Wang et al., 2024e), Hellaswag (Zellers et al., 2019)), reasoning (GSM8k (Cobbe et al., 2021), Knowledge Crosswords (Ding et al., 2024), NLGraph (Wang et al., 2024a; Zhang et al., 2024a)), and safety (Truthful QA (Lin et al., 2022), Real Toxicity Prompts (Gehman et al., 2020), Abstain QA (Feng et al., 2024a)). |
| Dataset Splits | Yes | We by default randomly sample 200 and 1000 samples as the validation/test sets: the utility function f is defined as performance on the validation set. ... We randomly sample subsets from each dataset and present the statistics in Table 10. |
| Hardware Specification | Yes | We empirically analyze the time complexity of employing 1 to 10 GPUs on our cluster of 16 A100 GPUs with 96 CPU cores with 10 default initial experts. |
| Software Dependencies | No | The paper mentions `GEMMA-7B`, `MISTRAL-7B`, `Lo RA`, `ROBERTA-BASE` as models or techniques, and `Python` implicitly, but no specific version numbers for software libraries or environments are provided in the main text to ensure reproducibility. |
| Experiment Setup | Yes | We fine-tune for 5 epochs with a starting learning rate of 2e-4 and effective batch size of 32 by default. For MODEL SWARMS searches, we employ N = 20, ϕλ = 0.95, p = 10, pr = 5, K = 50, while running grid search over other hyperparameters and report the best-found expert based on utility function f. Specifically, we search for ϕv {0.1, 0.2, 0.3}, ϕp {0.1, 0.2, 0.3, 0.4, 0.5}, ϕg {0.2, 0.3, 0.4, 0.5, 0.6}, ϕw {0.01, 0.05, 0.1}, λ {0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. |