reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CALM: Consensus-Aware Localized Merging for Multi-Task Learning

Authors: Kunda Yan, Min Zhang, Sen Cui, Qu Zikun, Bo Jiang, Feng Liu, Changshui Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate the superiority and robustness of our CALM, significantly outperforming existing methods and achieving performance close to traditional MTL. We conduct extensive experiments on various datasets and demonstrate that CALM exhibits superior performance in model merging, surpassing other state-of-the-art global-aware and localized-aware methods.
Researcher Affiliation	Academia	1Institute for Artificial Intelligence, Tsinghua University (THUAI), Beijing National Research Center for Information Science and Technology (BNRist), Department of Automation, Tsinghua University, Beijing, P.R.China 2East China Normal University (ECNU) 3The Chinese University of Hong Kong, Shenzhen 4School of Computing and Information Systems, The University of Melbourne.
Pseudocode	Yes	Algorithm 1 CALM Input: Number of tasks T, pretrained model θpre, finetuned models {θt ft}T t=1, task-specific unsupervised datasets {Dt(X)}T t=1, efficient merging coefficient λ, regularization parameter α.
Open Source Code	Yes	Code is available at https://github.com/yankd22/CALM.
Open Datasets	Yes	The eight visual classification datasets include SUN397 (Xiao et al., 2016), Cars (Krause et al., 2013), RESISC45 (Cheng et al., 2017), Euro SAT (Helber et al., 2019), SVHN (Netzer et al., 2011), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), and DTD (Cimpoi et al., 2014). The natural language tasks consist of 12 GLUE tasks (Wang, 2018), including six single-sentence tasks: SST-2 (Socher et al., 2013), CR (Hu & Liu, 2004), MR (Pa Ng B, 2005), MPQA (Wiebe et al., 2005), TREC (Voorhees et al., 1999) and SUBJ (Lee & Pang, 2004), and six pairwise-sentence tasks: QNLI (Wang, 2018), SNLI (Bowman et al., 2015), MNLI (Williams et al., 2017), RTE (Wang, 2018), MRPC (Dagan et al., 2005) and QQP (Iyer et al., 2017).
Dataset Splits	No	The paper lists various datasets, some of which are well-known to have standard splits. However, it does not explicitly state the training, validation, or test split percentages or exact counts used for the fine-tuned models in its own text. It mentions using 'unsupervised training samples for optimization' and 'validation set' for specific steps, but not the primary dataset splits for model training.
Hardware Specification	Yes	Part of the experiments is conducted on a local server with Ubuntu 16.04 system. It has two physical CPU chips which are Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz with 20 cpu cores. The other experiments are conducted on a remote server. It has 8 GPUs which are Ge Force RTX 3090.
Software Dependencies	No	The paper mentions an 'Ubuntu 16.04 system' but does not specify any programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) that would be necessary to replicate the experiments.
Experiment Setup	Yes	The merging process randomly selects two tasks for sequential merging, while others apply task arithmetic (Ilharco et al., 2023) with a coefficient of 0.3. For the visual tasks, 90% of the credible samples are used, and for the NLP tasks, 80% are used. Each task is optimized for 100 iterations, with the regularization parameter λ = 1. We initialize a real-valued mask R of the same size as the model and set 1e-5 of the parameter points to be active. The mask R is then iteratively trained with the credible sample set, which contains pseudo-labels, using a batch size of 128 and a learning rate of 1e7 a large learning rate to ensure effective information feedback to the mask. For each iteration, only two batch of the reliable sample set per task is used, with a total of 100 iterations.