Model merging with SVD to tie the Knots

Authors: George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, Judy Hoffman

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate Kn OTS across diverse benchmarks spanning both vision and language domains. We first evaluate Kn OTS on the popular per-task setting across both vision and language tasks ( 5.2). Second, we study the capabilities of merging methods building general models by introducing a new benchmark ( 5.3). Third, we conduct extensive analysis on different facets of Kn OTS ( 5.4).
Researcher Affiliation Collaboration 1Georgia Tech 2IBM Research, MIT Correspondence emails: EMAIL
Pseudocode No The paper describes the method in prose and provides an illustration in Figure 1, but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes We release our code at: https://github.com/gstoica27/Kn OTS.
Open Datasets Yes Merging eight Vi T-B/32 models finetuned on image classification datasets. We follow the image classification benchmark from Ilharco et al. (2023) and merge models finetuned on eight different datasets: Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), GTSRB Stallkamp et al. (2011), MNIST (Le Cun, 1998), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016) and SVHN (Netzer et al., 2011). ... We also evaluate Kn OTS in the NLI setting, by merging six PEFT llama3-8B (AI, 2024) models finetuned on SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018), SICK (Marelli et al., 2014), QNLI, RTE (Wang et al., 2019), and SCITAIL (Khot et al., 2018).
Dataset Splits Yes Specifically, this heldout set consists of the validation data of the respective dataset when it exists and otherwise randomly samples 20% of the test set. Note that in situations where we sample 20% of a dataset s test split, we always evaluate any merged model on the remaining 80% of examples.
Hardware Specification Yes All of our experiments were conducted on machines with one Nvidia A40 with 48GB of VRAM, and a CPU that has 8 workers.
Software Dependencies No The paper mentions software components like Adam W (Loshchilov & Hutter, 2019) and Pytorch (Paszke et al., 2019) with citations to their respective papers, but it does not specify exact version numbers for these software packages or libraries.
Experiment Setup Yes We set the Lo RA rank to be 16, Lo RA alpha to be 16, Lo RA dropout to be 0.1 and disable the use of bias parameters. All models are trained using the Adam W (Loshchilov & Hutter, 2019) optimizer, with a cosine learning rate scheduler (Loshchilov & Hutter, 2017) using Cross-Entropy loss. The Vi T-B/32 models were fine-tuned on the 8 vision tasks using a standard learning rate of 1e-5, weight decay of 1e-1 and label smoothing set to 0. The Vi T-L/14 models were fine-tuned on the 8 vision tasks using a standard learning rate of 3e-4, weight decay of 1e-4 and label smoothing set to 0. ... The models were trained using Adam W (Loshchilov & Hutter, 2019) optimizer using a linear learning rate scheduler, with a learning rate of 3e-5 and warm steps set to 6% of the total number of training steps.