LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging
Authors: Ke Wang, Nikos Dimitriadis, Alessandro Favero, Guillermo Ortiz-Jimenez, François Fleuret, Pascal Frossard
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Li Ne S demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing. It mitigates forgetting, enhances out-of-distribution generalization, integrates seamlessly with existing multi-task model merging baselines improving their performance across benchmarks and model sizes, and can boost generalization when merging LLM policies aligned with different rewards via RLHF. We empirically verify the effectiveness of applying Li Ne S across diverse application domains. Section 5.1 presents results for improving robust fine-tuning (Wortsman et al., 2022b) for OOD generalization; Section 5.2 focuses on improving existing multi-task merging methods (Ilharco et al., 2023; Yadav et al., 2023; Wang et al., 2024) in both vision and NLP benchmarks. In Section 5.3, we apply Li Ne S and improve the merging of single-task fine-tuned models within the setting of Model Soups (Wortsman et al., 2022a), and finally, we enhance merging foundation models fine-tuned on different rewards (Ram e et al., 2024a) in Section 5.4. |
| Researcher Affiliation | Collaboration | Ke Wang EPFL EMAIL Nikolaos Dimitriadis EPFL EMAIL Alessandro Favero EPFL EMAIL Guillermo Ortiz-Jimenez Google Deep Mind EMAIL François Fleuret University of Geneva, Meta FAIR EMAIL Pascal Frossard EPFL EMAIL |
| Pseudocode | Yes | Our proposed method is simple to implement1, orthogonal to many existing approaches, and improves performance in a wide variety of settings. 1Py Torch pseudo-code in Appendix A. A LINES PSEUDOCODE |
| Open Source Code | Yes | Our source code is available at github.com/wang-kee/Li Ne S. |
| Open Datasets | Yes | We evaluate CLIP models fine-tuned on Image Net (Deng et al., 2009), considering 5 OOD datasets, namely Image Net Sketch (Wang et al., 2019), Image Net-A (Hendrycks et al., 2021), Image Net-R (Hendrycks et al., 2020), Object Net (Barbu et al., 2019), Image Net-V2 (Recht et al., 2019). The 8-task benchmark comprises the following tasks: Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016), and SVHN (Netzer et al., 2011). |
| Dataset Splits | Yes | We consider the 8-task image classification benchmark studied in Ilharco et al. (2023). The scalar coefficient λ is tuned using a held-out validation set. We search for the hyper-parameter within the validation set and report the performance on test set with the best hyper-parameter based on validation performance. |
| Hardware Specification | No | The paper refers to models like CLIP Vi T-B/32, Vi T-L/14, Conv Ne Xt, and T5-large but does not specify the hardware (e.g., GPU models, CPU types, memory amounts) used for running experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch pseudo-code in Appendix A.' but does not specify version numbers for PyTorch or any other software dependencies, libraries, or programming languages used. |
| Experiment Setup | Yes | We apply this Li Ne S to each of the 70 fine-tuned checkpoints setting α = β = 0.5. For the linear scaling schedule, we tune only β and set α using a heuristic that adjusts based on both the number of merged models and the merging method. We fine-tune the model using Lo RA (Hu et al., 2022) with r Lo RA = 64, αLo RA = 128, and 0.05 dropout. We list here the hyper-parameter search space for each model merging method in Table B.2.1. |