reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Model merging with SVD to tie the Knots

Authors: George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, Judy Hoffman

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate Kn OTS across diverse benchmarks spanning both vision and language domains. We first evaluate Kn OTS on the popular per-task setting across both vision and language tasks ( 5.2). Second, we study the capabilities of merging methods building general models by introducing a new benchmark ( 5.3). Third, we conduct extensive analysis on different facets of Kn OTS ( 5.4).
Researcher Affiliation	Collaboration	1Georgia Tech 2IBM Research, MIT Correspondence emails: EMAIL
Pseudocode	No	The paper describes the method in prose and provides an illustration in Figure 1, but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	We release our code at: https://github.com/gstoica27/Kn OTS.
Open Datasets	Yes	Merging eight Vi T-B/32 models finetuned on image classification datasets. We follow the image classification benchmark from Ilharco et al. (2023) and merge models finetuned on eight different datasets: Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), GTSRB Stallkamp et al. (2011), MNIST (Le Cun, 1998), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016) and SVHN (Netzer et al., 2011). ... We also evaluate Kn OTS in the NLI setting, by merging six PEFT llama3-8B (AI, 2024) models finetuned on SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018), SICK (Marelli et al., 2014), QNLI, RTE (Wang et al., 2019), and SCITAIL (Khot et al., 2018).
Dataset Splits	Yes	Specifically, this heldout set consists of the validation data of the respective dataset when it exists and otherwise randomly samples 20% of the test set. Note that in situations where we sample 20% of a dataset s test split, we always evaluate any merged model on the remaining 80% of examples.
Hardware Specification	Yes	All of our experiments were conducted on machines with one Nvidia A40 with 48GB of VRAM, and a CPU that has 8 workers.
Software Dependencies	No	The paper mentions software components like Adam W (Loshchilov & Hutter, 2019) and Pytorch (Paszke et al., 2019) with citations to their respective papers, but it does not specify exact version numbers for these software packages or libraries.
Experiment Setup	Yes	We set the Lo RA rank to be 16, Lo RA alpha to be 16, Lo RA dropout to be 0.1 and disable the use of bias parameters. All models are trained using the Adam W (Loshchilov & Hutter, 2019) optimizer, with a cosine learning rate scheduler (Loshchilov & Hutter, 2017) using Cross-Entropy loss. The Vi T-B/32 models were fine-tuned on the 8 vision tasks using a standard learning rate of 1e-5, weight decay of 1e-1 and label smoothing set to 0. The Vi T-L/14 models were fine-tuned on the 8 vision tasks using a standard learning rate of 3e-4, weight decay of 1e-4 and label smoothing set to 0. ... The models were trained using Adam W (Loshchilov & Hutter, 2019) optimizer using a linear learning rate scheduler, with a learning rate of 3e-5 and warm steps set to 6% of the total number of training steps.