reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CAT Merging: A Training-Free Approach for Resolving Conflicts in Model Merging

Authors: Wenju Sun, Qingyong Li, Yangliao Geng, Boyang Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on vision, language, and vision-language tasks demonstrate that CAT Merging effectively suppresses knowledge conflicts, achieving average accuracy improvements of up to 2.5% (Vi T-B/32) and 2.0% (Vi T-L/14) over state-of-the-art methods.
Researcher Affiliation	Academia	1Key Laboratory of Big Data & Artificial Intelligence in Transportation (Ministry of Education), School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China 2College of Computing and Data Science, Nanyang Technological University, Singapore.
Pseudocode	Yes	The detailed implementation of this process is described in Algorithm 1.
Open Source Code	No	The text is ambiguous or lacks a clear, affirmative statement of release.
Open Datasets	Yes	We select diverse datasets to evaluate our work, including eight vision datasets: SUN397 (Xiao et al., 2016), Cars (Krause et al., 2013), RESISC45 (Cheng et al., 2017), Euro SAT (Helber et al., 2019), SVHN (Netzer et al., 2011), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun & Cortes, 2010), DTD (Cimpoi et al., 2014), six visual-language datasets: COCO Caption (Chen et al., 2015), Flickr30k Caption (Plummer et al., 2015), Textcaps (Sidorov et al., 2020), OKVQA (Marino et al., 2019), Text VQA (Singh et al., 2019), and Science QA (Lu et al., 2022), and eight NLP tasks in the GLUE benchmark (Wang et al., 2019).
Dataset Splits	Yes	For SVHN: "It includes 10 digit classes, with 73,257 training images, 26,032 test images, and an additional 531,131 samples for extended training." For MNIST: "A classic handwritten digit classification dataset containing 60,000 training images and 10,000 test images, evenly distributed across 10 digit classes." For NLP tasks: "Following their setting, we reserve 10% of the training set for validation and employ the original validation data as the test set."
Hardware Specification	Yes	All experiments detailed in our manuscript and appendix were conducted on a workstation running Ubuntu 16.04, equipped with 2 Intel Xeon 2.60GHz CPUs, 256 GB of memory, and 6 NVIDIA RTX3090 GPUs.
Software Dependencies	Yes	We leverage Python 3.8 to implement all the methods.
Experiment Setup	Yes	For the visual-language tasks, we derive task vectors by finetuning the visual question-answering (VQA) version of BLIP (Li et al., 2022), training each task for 6,000 steps. In Figure 2 (b), with different values of α, CAT Merging achieves more stable performance than Task Arithmetic. c affects the number of task vector components are trimmed. This section analyzes the sensitivity of two additional hyper-parameters λ and c.