CAT Merging: A Training-Free Approach for Resolving Conflicts in Model Merging

Authors: Wenju Sun, Qingyong Li, Yangliao Geng, Boyang Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on vision, language, and vision-language tasks demonstrate that CAT Merging effectively suppresses knowledge conflicts, achieving average accuracy improvements of up to 2.5% (Vi T-B/32) and 2.0% (Vi T-L/14) over state-of-the-art methods.
Researcher Affiliation Academia 1Key Laboratory of Big Data & Artificial Intelligence in Transportation (Ministry of Education), School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China 2College of Computing and Data Science, Nanyang Technological University, Singapore.
Pseudocode Yes The detailed implementation of this process is described in Algorithm 1.
Open Source Code No The text is ambiguous or lacks a clear, affirmative statement of release.
Open Datasets Yes We select diverse datasets to evaluate our work, including eight vision datasets: SUN397 (Xiao et al., 2016), Cars (Krause et al., 2013), RESISC45 (Cheng et al., 2017), Euro SAT (Helber et al., 2019), SVHN (Netzer et al., 2011), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun & Cortes, 2010), DTD (Cimpoi et al., 2014), six visual-language datasets: COCO Caption (Chen et al., 2015), Flickr30k Caption (Plummer et al., 2015), Textcaps (Sidorov et al., 2020), OKVQA (Marino et al., 2019), Text VQA (Singh et al., 2019), and Science QA (Lu et al., 2022), and eight NLP tasks in the GLUE benchmark (Wang et al., 2019).
Dataset Splits Yes For SVHN: "It includes 10 digit classes, with 73,257 training images, 26,032 test images, and an additional 531,131 samples for extended training." For MNIST: "A classic handwritten digit classification dataset containing 60,000 training images and 10,000 test images, evenly distributed across 10 digit classes." For NLP tasks: "Following their setting, we reserve 10% of the training set for validation and employ the original validation data as the test set."
Hardware Specification Yes All experiments detailed in our manuscript and appendix were conducted on a workstation running Ubuntu 16.04, equipped with 2 Intel Xeon 2.60GHz CPUs, 256 GB of memory, and 6 NVIDIA RTX3090 GPUs.
Software Dependencies Yes We leverage Python 3.8 to implement all the methods.
Experiment Setup Yes For the visual-language tasks, we derive task vectors by finetuning the visual question-answering (VQA) version of BLIP (Li et al., 2022), training each task for 6,000 steps. In Figure 2 (b), with different values of α, CAT Merging achieves more stable performance than Task Arithmetic. c affects the number of task vector components are trimmed. This section analyzes the sensitivity of two additional hyper-parameters λ and c.