reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scalable Model Merging with Progressive Layer-wise Distillation

Authors: Jing Xu, Jiazheng Li, Jingzhao Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to evaluate the performance of Pro Distill across various tasks, architectures, and scales. Compared to both training-based and training-free baselines, Pro Distill achieves a notable 6.14% increase in absolute performance for vision tasks and 6.61% increase for natural language understanding tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of Pro Distill.
Researcher Affiliation	Academia	1Institute for Interdisciplinary Information Sciences, Tsinghua University 2Shanghai Qizhi Institute 3School of Computer Science, Beijing Institute of Technology. Correspondence to: Jing Xu <EMAIL>, Jiazheng Li <EMAIL>, Jingzhao Zhang <EMAIL>.
Pseudocode	Yes	Algorithm 1: Pro Distill (Progressive Layer-wise Distillation)
Open Source Code	Yes	Code is available at https://github.com/ Jing Xu THU/Scalable_Model_Merging_with_ Progressive_Layerwise_Distillation.
Open Datasets	Yes	For vision tasks, we follow the initial practice of (Ilharco et al., 2022) and build a vision benchmark consisting of eight datasets, including MNIST (Le Cun et al., 2010), Euro SAT (Helber et al., 2019), GTSRB (Stallkamp et al., 2011), SVHN (Netzer et al., 2011), DTD (Cimpoi et al., 2014), RESISC45 (Cheng et al., 2017), Stanford Cars (Krause et al., 2013), SUN397 (Xiao et al., 2016). For natural language understanding (NLU) tasks, we follow the practice in Yu et al. (2024b) and use eight datasets from the GLUE benchmark (Wang, 2018), including Co LA (Warstadt et al., 2018), SST-2 (Socher et al., 2013), MRPC (Dolan & Brockett, 2005), STS-B (Cer et al., 2017), QQP (Iyer et al., 2017), MNLI (Williams et al., 2017), QNLI (Wang, 2018; Rajpurkar, 2016), RTE (Wang, 2018; Dagan et al., 2005; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009).
Dataset Splits	Yes	All methods, except Task Arithmetic, require a few-shot validation dataset that contains domain-specific information of downstream tasks. This seems to contradict the data-free nature of task arithmetic. In light of this, we raise the question: Is domain-specific data necessary for model merging? ... The few-shot validation set is randomly sampled from the training set, with validation shot set to 64 per task. ... For NLG tasks, the validation set is randomly sampled from the test set of Alpaca Eval 2.0, GSM8K and MBPP, and we exclude these test data points in evaluation.
Hardware Specification	No	The paper discusses memory costs and GPU memory footprint (
Software Dependencies	No	The paper mentions optimizers like Adam (Kingma, 2014) or Adam W (Loshchilov, 2017) and models like BERT-base-uncased and RoBERTa-base, but does not provide specific version numbers for software libraries or environments used for implementation.
Experiment Setup	Yes	For NLU tasks, we fine-tune BERT-base-uncased and Ro BERTa-base models for 10 epochs. The weight decay is set to 0.01. We use a learning rate of 10 5 with a warm-up strategy. ... For Vi T models and LLMs, the learning rate is chosen from {0.1, 0.01}; for Bert/Ro BERTa models, the learning rate is chosen from {0.01, 0.001}. The number of epochs is chosen from {50, 100, 200}. ... The batch size is set to 32 for vision tasks and 16 for NLU tasks.