reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging

Authors: Ke Wang, Nikos Dimitriadis, Alessandro Favero, Guillermo Ortiz-Jimenez, François Fleuret, Pascal Frossard

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Li Ne S demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing. It mitigates forgetting, enhances out-of-distribution generalization, integrates seamlessly with existing multi-task model merging baselines improving their performance across benchmarks and model sizes, and can boost generalization when merging LLM policies aligned with different rewards via RLHF. We empirically verify the effectiveness of applying Li Ne S across diverse application domains. Section 5.1 presents results for improving robust fine-tuning (Wortsman et al., 2022b) for OOD generalization; Section 5.2 focuses on improving existing multi-task merging methods (Ilharco et al., 2023; Yadav et al., 2023; Wang et al., 2024) in both vision and NLP benchmarks. In Section 5.3, we apply Li Ne S and improve the merging of single-task fine-tuned models within the setting of Model Soups (Wortsman et al., 2022a), and finally, we enhance merging foundation models fine-tuned on different rewards (Ram e et al., 2024a) in Section 5.4.
Researcher Affiliation	Collaboration	Ke Wang EPFL EMAIL Nikolaos Dimitriadis EPFL EMAIL Alessandro Favero EPFL EMAIL Guillermo Ortiz-Jimenez Google Deep Mind EMAIL François Fleuret University of Geneva, Meta FAIR EMAIL Pascal Frossard EPFL EMAIL
Pseudocode	Yes	Our proposed method is simple to implement1, orthogonal to many existing approaches, and improves performance in a wide variety of settings. 1Py Torch pseudo-code in Appendix A. A LINES PSEUDOCODE
Open Source Code	Yes	Our source code is available at github.com/wang-kee/Li Ne S.
Open Datasets	Yes	We evaluate CLIP models fine-tuned on Image Net (Deng et al., 2009), considering 5 OOD datasets, namely Image Net Sketch (Wang et al., 2019), Image Net-A (Hendrycks et al., 2021), Image Net-R (Hendrycks et al., 2020), Object Net (Barbu et al., 2019), Image Net-V2 (Recht et al., 2019). The 8-task benchmark comprises the following tasks: Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016), and SVHN (Netzer et al., 2011).
Dataset Splits	Yes	We consider the 8-task image classification benchmark studied in Ilharco et al. (2023). The scalar coefficient λ is tuned using a held-out validation set. We search for the hyper-parameter within the validation set and report the performance on test set with the best hyper-parameter based on validation performance.
Hardware Specification	No	The paper refers to models like CLIP Vi T-B/32, Vi T-L/14, Conv Ne Xt, and T5-large but does not specify the hardware (e.g., GPU models, CPU types, memory amounts) used for running experiments.
Software Dependencies	No	The paper mentions 'Py Torch pseudo-code in Appendix A.' but does not specify version numbers for PyTorch or any other software dependencies, libraries, or programming languages used.
Experiment Setup	Yes	We apply this Li Ne S to each of the 70 fine-tuned checkpoints setting α = β = 0.5. For the linear scaling schedule, we tune only β and set α using a heuristic that adjusts based on both the number of merged models and the merging method. We fine-tune the model using Lo RA (Hu et al., 2022) with r Lo RA = 64, αLo RA = 128, and 0.05 dropout. We list here the hyper-parameter search space for each model merging method in Table B.2.1.