Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent
Authors: Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, Xiaochun Cao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our plugand-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains. Our code is available here. We conduct experiments on diverse vision and NLP tasks, including classification and generation, using various fully fine-tuned and Lo RA fine-tuned architectures. Our plugand-play approach achieves up to 11.6% gains over TA and 5.8% over Ada Merging. Simple task-aware λ provides a 2.8% performance boost. Furthermore, experiments on unseen tasks and out-of-distribution test sets demonstrate its generalization and robustness. Extensive ablation studies clarify the mechanisms of each component. |
| Researcher Affiliation | Academia | 1Tsinghua University 2Wuhan University 3Shenzhen Campus of Sun Yat-sen University 4Nanyang Technological University. Correspondence to: Li Shen <EMAIL>, Chun Yuan <EMAIL>. |
| Pseudocode | Yes | Algorithm 1: Adaptive Projective Gradient Descent |
| Open Source Code | Yes | Experiments show that our plugand-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains. Our code is available here. |
| Open Datasets | Yes | For vision tasks, we use the Vi T-B/32 and Vi T-L/14 models, originally derived from CLIP (Radford et al., 2021). The downstream tasks encompass a variety of challenges, including SUN397 (Xiao et al., 2016), Stanford Cars (Krause et al., 2013), RESISC45 (Cheng et al., 2017), Euro SAT (Helber et al., 2019), SVHN (Netzer et al., 2011), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), and DTD (Cimpoi et al., 2014). For NLP tasks, we use the Flan-T5-base and Flan T5-large models (Chung et al., 2024), evaluated on eight tasks from the GLUE benchmark (Wang et al., 2019). |
| Dataset Splits | Yes | For vision tasks, we use the Vi T-B/32 and Vi T-L/14 models, originally derived from CLIP (Radford et al., 2021). The downstream tasks encompass a variety of challenges, including SUN397 (Xiao et al., 2016), Stanford Cars (Krause et al., 2013), RESISC45 (Cheng et al., 2017), Euro SAT (Helber et al., 2019), SVHN (Netzer et al., 2011), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), and DTD (Cimpoi et al., 2014). For NLP tasks, we use the Flan-T5-base and Flan T5-large models (Chung et al., 2024), evaluated on eight tasks from the GLUE benchmark (Wang et al., 2019). |
| Hardware Specification | Yes | The experiments in our study were conducted on a consistent hardware setup, utilizing NVIDIA GTX 4090 GPUs equipped with 24GB of memory. |
| Software Dependencies | Yes | For the implementation of our experiments, we employed Py Torch version 2.5 with Python 3.10. |
| Experiment Setup | Yes | We perform 400 iterations of learning with a learning rate of 1e 4. The global magnitude of the merging coefficient η is set to 0.07 for vision tasks and 0.15 for NLP tasks. The subspace basis size k is simply defined as the rank of each task vector divided by the number of tasks (i.e., 8). Following Ties-Merging (Yadav et al., 2023), we retain only the top 30% of parameters with the largest magnitudes. We only apply our method to the linear layer in the model. For vision tasks, we employ pre-trained models from CLIP (Radford et al., 2021), fine-tuning them using the Adam W optimizer with a weight decay of 0.1 and a learning rate of 1 10 5. We maintain a constant learning rate of 4 10 5 and a uniform batch size of 16 across all tasks, fine-tuning for 2000 steps per task. |