reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent

Authors: Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, Xiaochun Cao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our plugand-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains. Our code is available here. We conduct experiments on diverse vision and NLP tasks, including classification and generation, using various fully fine-tuned and Lo RA fine-tuned architectures. Our plugand-play approach achieves up to 11.6% gains over TA and 5.8% over Ada Merging. Simple task-aware λ provides a 2.8% performance boost. Furthermore, experiments on unseen tasks and out-of-distribution test sets demonstrate its generalization and robustness. Extensive ablation studies clarify the mechanisms of each component.
Researcher Affiliation	Academia	1Tsinghua University 2Wuhan University 3Shenzhen Campus of Sun Yat-sen University 4Nanyang Technological University. Correspondence to: Li Shen <EMAIL>, Chun Yuan <EMAIL>.
Pseudocode	Yes	Algorithm 1: Adaptive Projective Gradient Descent
Open Source Code	Yes	Experiments show that our plugand-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains. Our code is available here.
Open Datasets	Yes	For vision tasks, we use the Vi T-B/32 and Vi T-L/14 models, originally derived from CLIP (Radford et al., 2021). The downstream tasks encompass a variety of challenges, including SUN397 (Xiao et al., 2016), Stanford Cars (Krause et al., 2013), RESISC45 (Cheng et al., 2017), Euro SAT (Helber et al., 2019), SVHN (Netzer et al., 2011), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), and DTD (Cimpoi et al., 2014). For NLP tasks, we use the Flan-T5-base and Flan T5-large models (Chung et al., 2024), evaluated on eight tasks from the GLUE benchmark (Wang et al., 2019).
Dataset Splits	Yes	For vision tasks, we use the Vi T-B/32 and Vi T-L/14 models, originally derived from CLIP (Radford et al., 2021). The downstream tasks encompass a variety of challenges, including SUN397 (Xiao et al., 2016), Stanford Cars (Krause et al., 2013), RESISC45 (Cheng et al., 2017), Euro SAT (Helber et al., 2019), SVHN (Netzer et al., 2011), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), and DTD (Cimpoi et al., 2014). For NLP tasks, we use the Flan-T5-base and Flan T5-large models (Chung et al., 2024), evaluated on eight tasks from the GLUE benchmark (Wang et al., 2019).
Hardware Specification	Yes	The experiments in our study were conducted on a consistent hardware setup, utilizing NVIDIA GTX 4090 GPUs equipped with 24GB of memory.
Software Dependencies	Yes	For the implementation of our experiments, we employed Py Torch version 2.5 with Python 3.10.
Experiment Setup	Yes	We perform 400 iterations of learning with a learning rate of 1e 4. The global magnitude of the merging coefficient η is set to 0.07 for vision tasks and 0.15 for NLP tasks. The subspace basis size k is simply defined as the rank of each task vector divided by the number of tasks (i.e., 8). Following Ties-Merging (Yadav et al., 2023), we retain only the top 30% of parameters with the largest magnitudes. We only apply our method to the linear layer in the model. For vision tasks, we employ pre-trained models from CLIP (Radford et al., 2021), fine-tuning them using the Adam W optimizer with a weight decay of 0.1 and a learning rate of 1 10 5. We maintain a constant learning rate of 4 10 5 and a uniform batch size of 16 across all tasks, fine-tuning for 2000 steps per task.