CALM: Consensus-Aware Localized Merging for Multi-Task Learning
Authors: Kunda Yan, Min Zhang, Sen Cui, Qu Zikun, Bo Jiang, Feng Liu, Changshui Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate the superiority and robustness of our CALM, significantly outperforming existing methods and achieving performance close to traditional MTL. We conduct extensive experiments on various datasets and demonstrate that CALM exhibits superior performance in model merging, surpassing other state-of-the-art global-aware and localized-aware methods. |
| Researcher Affiliation | Academia | 1Institute for Artificial Intelligence, Tsinghua University (THUAI), Beijing National Research Center for Information Science and Technology (BNRist), Department of Automation, Tsinghua University, Beijing, P.R.China 2East China Normal University (ECNU) 3The Chinese University of Hong Kong, Shenzhen 4School of Computing and Information Systems, The University of Melbourne. |
| Pseudocode | Yes | Algorithm 1 CALM Input: Number of tasks T, pretrained model θpre, finetuned models {θt ft}T t=1, task-specific unsupervised datasets {Dt(X)}T t=1, efficient merging coefficient λ, regularization parameter α. |
| Open Source Code | Yes | Code is available at https://github.com/yankd22/CALM. |
| Open Datasets | Yes | The eight visual classification datasets include SUN397 (Xiao et al., 2016), Cars (Krause et al., 2013), RESISC45 (Cheng et al., 2017), Euro SAT (Helber et al., 2019), SVHN (Netzer et al., 2011), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), and DTD (Cimpoi et al., 2014). The natural language tasks consist of 12 GLUE tasks (Wang, 2018), including six single-sentence tasks: SST-2 (Socher et al., 2013), CR (Hu & Liu, 2004), MR (Pa Ng B, 2005), MPQA (Wiebe et al., 2005), TREC (Voorhees et al., 1999) and SUBJ (Lee & Pang, 2004), and six pairwise-sentence tasks: QNLI (Wang, 2018), SNLI (Bowman et al., 2015), MNLI (Williams et al., 2017), RTE (Wang, 2018), MRPC (Dagan et al., 2005) and QQP (Iyer et al., 2017). |
| Dataset Splits | No | The paper lists various datasets, some of which are well-known to have standard splits. However, it does not explicitly state the training, validation, or test split percentages or exact counts used for the fine-tuned models in its own text. It mentions using 'unsupervised training samples for optimization' and 'validation set' for specific steps, but not the primary dataset splits for model training. |
| Hardware Specification | Yes | Part of the experiments is conducted on a local server with Ubuntu 16.04 system. It has two physical CPU chips which are Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz with 20 cpu cores. The other experiments are conducted on a remote server. It has 8 GPUs which are Ge Force RTX 3090. |
| Software Dependencies | No | The paper mentions an 'Ubuntu 16.04 system' but does not specify any programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) that would be necessary to replicate the experiments. |
| Experiment Setup | Yes | The merging process randomly selects two tasks for sequential merging, while others apply task arithmetic (Ilharco et al., 2023) with a coefficient of 0.3. For the visual tasks, 90% of the credible samples are used, and for the NLP tasks, 80% are used. Each task is optimized for 100 iterations, with the regularization parameter λ = 1. We initialize a real-valued mask R of the same size as the model and set 1e-5 of the parameter points to be active. The mask R is then iteratively trained with the credible sample set, which contains pseudo-labels, using a batch size of 128 and a learning rate of 1e7 a large learning rate to ensure effective information feedback to the mask. For each iteration, only two batch of the reliable sample set per task is used, with a total of 100 iterations. |