Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic
Authors: Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, Han Zhao
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate our method on various vision and language benchmarks, showing that it outperforms existing model merging methods under different data availability scenarios. |
| Researcher Affiliation | Academia | Yifei He EMAIL University of Illinois Urbana-Champaign Yuzheng Hu EMAIL University of Illinois Urbana-Champaign Yong Lin EMAIL Princeton University Tong Zhang EMAIL University of Illinois Urbana-Champaign Han Zhao EMAIL University of Illinois Urbana-Champaign |
| Pseudocode | Yes | Algorithm 1 Localize-and-Stitch Input: Pretrained model θpre, finetuned models {θ(i) ft }n i=1, regularization coefficient λ, magnitude threshold k Output: Merged model θmerged, binary masks {γi}n i=1 // Step 1: Localization for i = 1, 2, , n do Compute the task vector τi = θ(i) ft θpre if validation data available then Si = arg min S ℓi (θpre + σ(S) τi) + λ σ(S) 1 // make the mask binary γi = round(σ(Si)) else // Dataless Localization γi[|τi| > top-k(|τi|)] = 1 otherwise 0 end if end for // Step 2: Stitching for i = 1, 2, , n do for k = 1, 2, , d do // take average for overlaps γ i[k] = γi[k]/ Pn j=1 γj[k] end for end for return θmerged = θpre + Pn i=1 (γ i τi) |
| Open Source Code | Yes | Our code is available at https://github.com/uiuctml/Localize-and-Stitch. |
| Open Datasets | Yes | We evaluate our method on various vision and language tasks, showing that it outperforms existing model merging methods under different data availability scenarios. The dataset suite includes six single-sentence tasks (SST-2 (Socher et al., 2013), CR (Hu & Liu, 2004), MR (Pang & Lee, 2005), MPQA (Wiebe et al., 2005), TREC (Voorhees et al., 1999), SUBJ (Pang & Lee, 2004)) and six pairwise-sentence tasks (QNLI (Wang et al., 2018), SNLI (Bowman et al., 2015), MNLI (Williams et al., 2017), RTE (Wang et al., 2018), MRPC Dolan & Brockett (2005), QQP (Iyer et al.)). |
| Dataset Splits | Yes | Our localization step is performed with 64-shot validation data, and the sparsity is chosen to be 1%. In the dataless version, the sparsity is chosen to be 5%. To assess these models, we use MMLU (Hendrycks et al., 2021), ARC (Clark et al., 2018) and Truthful QA (Lin et al., 2021) as evaluation datasets for the respective domains. Unlike datasets in the previous section, these are typically used in their entirety for evaluation, without a designated train-test split. However, using these datasets for both evaluation and localization could lead to data leakage. To prevent this, we use data from three surrogate datasets with similar purposes for localization, namely Alpaca (Taori et al., 2023), GSM8K (Cobbe et al., 2021) and Hotpot QA (Yang et al., 2018). |
| Hardware Specification | Yes | The experiments are run on NVIDIA RTX A6000 GPUs with 48GB memory. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies like programming languages or libraries (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | For the experiments on Ro BERTa-base, we perform the finetuning process following the same procedure as Panigrahi et al. (2023). Specifically, we use a batch size of 4 and a learning rate of 2e-5 to finetune on each of the language tasks for 10 epochs with the SGD optimizer. For the experiments on CLIP Vi T, we directly use the finetuned checkpoints provided in Ilharco et al. (2023) with the data preprocessing step provided in (Yang et al., 2023). Following the practice in Panigrahi et al. (2023), in the localization step, we initialize the trainable real-valued vector S as the mask for top-k% largest entries in the task vector. Since the actual map is rounded from σ(S) but not S, we choose the initial values of S to be either 0 or 3, as σ(3) is sufficiently close to 1. To achieve a sparsity level of 1%, we use the learning rate 1e7, batch size 16, L1 regularization factor λ 1e-5 and perform the optimization for 10 epochs on 64-shot data from each task. Following common practice in Panigrahi et al. (2023); Yadav et al. (2023), we only perform localization in the transformer blocks, and do not consider embedding layers. |