Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Authors: Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. 3. Experiments
Researcher Affiliation Industry 1Apple. Correspondence to: Pierre Ablin <p EMAIL>.
Pseudocode Yes Algorithm 1 Sampling from mix(h) = Pk i=1 hi Di Algorithm 2 Pre-training loop for a Soup-of-Experts to minimize the loss function L(S, E, ω) in Equation 3. Algorithm 3 (Grangier et al., 2024b) Estimating specialist domain weights that are good for a specialized dataset Dspe
Open Source Code No The paper does not provide any explicit statement or link for open-sourcing the code for the described methodology.
Open Datasets Yes Pretraining domains We pre-train language model on Redpajama2 (Weber et al., 2024), a widely used curated web-crawl dataset. Specialization domains We consider 16 datasets from the PILE (Gao et al., 2020) as target specialization sets: arxiv, dm mathematics, enron emails, europarl, freelaw, github, hackernews, nih exporter, openwebtext, pg19, phil papers, pubmed, stackexchange, ubuntu, uspto, and wikipedia.
Dataset Splits No The paper mentions using datasets and evaluating on "specialization domains" or "held-out part of these datasets" but does not specify exact percentages, sample counts, or clear train/validation/test splits.
Hardware Specification Yes Infrastructure We train each model on 8 A100 GPUs.
Software Dependencies No The paper mentions algorithms like Adam and Sentence-BERT but does not provide specific version numbers for any software libraries or frameworks used.
Experiment Setup Yes Table 3. Training hyperparameters. Table 2. Model architectures.