Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
Authors: Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. 3. Experiments |
| Researcher Affiliation | Industry | 1Apple. Correspondence to: Pierre Ablin <p EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Sampling from mix(h) = Pk i=1 hi Di Algorithm 2 Pre-training loop for a Soup-of-Experts to minimize the loss function L(S, E, ω) in Equation 3. Algorithm 3 (Grangier et al., 2024b) Estimating specialist domain weights that are good for a specialized dataset Dspe |
| Open Source Code | No | The paper does not provide any explicit statement or link for open-sourcing the code for the described methodology. |
| Open Datasets | Yes | Pretraining domains We pre-train language model on Redpajama2 (Weber et al., 2024), a widely used curated web-crawl dataset. Specialization domains We consider 16 datasets from the PILE (Gao et al., 2020) as target specialization sets: arxiv, dm mathematics, enron emails, europarl, freelaw, github, hackernews, nih exporter, openwebtext, pg19, phil papers, pubmed, stackexchange, ubuntu, uspto, and wikipedia. |
| Dataset Splits | No | The paper mentions using datasets and evaluating on "specialization domains" or "held-out part of these datasets" but does not specify exact percentages, sample counts, or clear train/validation/test splits. |
| Hardware Specification | Yes | Infrastructure We train each model on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions algorithms like Adam and Sentence-BERT but does not provide specific version numbers for any software libraries or frameworks used. |
| Experiment Setup | Yes | Table 3. Training hyperparameters. Table 2. Model architectures. |