reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

What Matters for Model Merging at Scale?

Authors: Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, Tsendsuren Munkhdalai

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work systematically evaluates the utility of model merging at scale for transformer based models to examine the impact of these different factors. We experiment with merging fully fine-tuned models using four popular merging methods Averaging, Task Arithmetic, Dare-TIES, and TIES-Merging across model sizes ranging from 1B to 64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert s training tasks, and zero-shot generalization to unseen held-out tasks. Our wide range of experiments provide several new insights about merging transformer based models at scale and the interplay between different factors.
Researcher Affiliation	Collaboration	Prateek Yadav EMAIL The University of North Carolina at Chapel Hill Google Deep Mind Tu Vu EMAIL Virginia Tech Google Deep Mind Jonathan Lai EMAIL Google Deep Mind Alexandra Chronopoulou EMAIL Google Deep Mind Manaal Faruqui EMAIL Google Deep Mind Mohit Bansal EMAIL The University of North Carolina at Chapel Hill Tsendsuren Munkhdalai EMAIL Google Deep Mind
Pseudocode	No	The paper describes model merging methods (Averaging, Task Arithmetic, TIES Merging, Dare Merging) using mathematical formulas and descriptions, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code or a direct link to a code repository for the methodology described.
Open Datasets	Yes	Specifically, the 8 held-in task categories (with a total of 16 datasets) include Multiple-choice QA (with selected datasets DREAM (Sun et al., 2019), Cosmos QA (Huang et al., 2019)), Extractive Qa (Adversarial QA (Adelani et al., 2021), ROPES (Lin et al., 2019)), Closed-Book QA (Hotpot QA (Yang et al., 2018), Wiki QA (Yang et al., 2015)), Sentiment Analysis (App Reviews (), IMDB (Maas et al., 2011)), Topic Classification (AG News (Zhang et al., 2015), DBPedia (Lehmann et al., 2015)), Structure-to-text (Common Gen (Lin et al., 2020), Wiki Bio (Lebret et al., 2016)), Summarization (CNN Daily Mail (See et al., 2017), XSum (Narayan et al., 2018)) and Paraphrase Identification (MRPC (Dolan & Brockett, 2005), QQP (Iyer et al., 2017)). Similary, the 4 held-out task categories are Sentence Completion (with selected dataset COPA (Roemmele et al., 2011), Hella Swag (Zellers et al., 2019)), Natural Language Inference (ANLI (Nie et al., 2019), RTE (Dagan et al., 2005)), Coreference Resolution (WSC (Levesque et al., 2012b), Winogrande (Levesque et al., 2012a)) and Word Sense Disambiguation (Wi C (Pilehvar & Camacho-Collados, 2018)).
Dataset Splits	No	The paper mentions held-in and held-out tasks and specifies the datasets used for evaluation, but it does not provide explicit details about the training, validation, and test splits (e.g., percentages, sample counts, or specific split files) for these datasets as applied to their fine-tuning process.
Hardware Specification	No	The paper mentions 'Given the compute constraints' and refers to 'computational requirements' in Appendix B, but it does not specify any particular hardware components like GPU models, CPU types, or cloud computing instance details used for running the experiments.
Software Dependencies	No	The paper mentions using a 'Sharded Adafactor (Shazeer & Stern, 2018) optimizer' but does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	We train the Pa LM-2 model for an additional 60000 steps on the Flan-v2 dataset (Longpre et al., 2023) to obtain the Pa LM-2-IT model. We used Sharded Adafactor (Shazeer & Stern, 2018) optimizer along with a cosine decay and a learning rate of 1e-4 for 1B, 24B, and 64B model sizes and 3e-5 for 8B model. We use a dropout value of 0.05. Following Chung et al. (2024), we used an input length of 2048 and output length of 512. To create expert models we perform full finetuning with the following hyperparameters. For training the experts model, for all model size, we train by default for 2000 steps with a learning rate of 3e-5 and dropout of 0.05. For some task we adjust the number of steps depending upon the convergence. ...for Task Arithmetic, TIES and DARE methods, we tested values between 0 and 1, in steps of 0.1. For TIES and DARE, we pruned 80% and 90% of the values.