reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Model Editing with Task-Localized Sparse Fine-tuning

Authors: Leonardo Iurada, Marco Ciccone, Tatiana Tommasi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical analyses and theoretical justifications demonstrate that our approach effectively promotes weight disentanglement, ensuring compatibility between task vectors without the need for sharing information between users and users. This enables efficient and robust model editing through the simple addition and subtraction of sparse task vectors, facilitating decentralized collaborative strategies. Our experimental evaluation focuses on the established Task Arithmetic framework outlined by Ilharco et al. (2022; 2023), specifically targeting Task Addition and Task Negation, encompassing both language and vision domains. In the following we describe the baselines we compared our Ta Lo S against. Further details regarding the experimental setups, the relevant metrics, the implementation of the experiments, as well as the data and architectures used, are deferred to Appendix A.1.
Researcher Affiliation	Academia	Leonardo Iurada1, , Marco Ciccone2, Tatiana Tommasi1 1Politecnico di Torino, Italy 2Vector Institute, Toronto, Ontario, Canada Correspondance to: EMAIL
Pseudocode	Yes	Pseudocode for our algorithm is included in Appendix A.2 to clarify key steps, as well as practical design choices to address potential challenges in implementing our experiments. Additionally, we publicly released our code to further facilitate reproducibility at https://github.com/iurada/talos-task-arithmetic. Algorithm 1: Ta Lo S to obtain task vectors
Open Source Code	Yes	Code available at: https://github.com/iurada/talos-task-arithmetic
Open Datasets	Yes	In line with what introduced in Ilharco et al. (2022; 2023); Ortiz-Jimenez et al. (2023), our vision experiments consider image classification across various domains. We adhere to the proposed experimental setup by utilizing eight datasets: Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016) and SVHN (Netzer et al., 2011). For the natural language processing (NLP) experiments, we follow the methodology outlined in Yadav et al. (2023), incorporating seven prescribed datasets: three regarding question answering (QASC (Khot et al., 2020), Wiki QA (Yang et al., 2015) and Qua RTz (Tafjord et al., 2019)), one for paraphrase identification (PAWS (Zhang et al., 2019)), one focusing on sentence completion (Story Cloze (Sharma et al., 2018)) and two for coreference resolution (Winogrande (Sakaguchi et al., 2021) and WSC (Levesque et al., 2012)). Concerning Task Negation, we align with Ortiz-Jimenez et al. (2023) and consider Image Net (Deng et al., 2009) as the control dataset for vision experiments, while for NLP, we utilize RTE (Dagan et al., 2005), as it provides a distinct task (i.e. natural language inference) with respect to the others considered for the NLP experiments.
Dataset Splits	Yes	Regarding the amount of data used to perform mask calibration on each task, we align with Panda et al. (2024) by using the validation split as it accounts for the 10% of the total training data. For Lo TA, we set the number of iterations for mask calibration so to match the number of mask calibration rounds used by our method (further details at Section A.2). This ensures that the drop in performance is negligible with respect to using the full training split while significantly reducing the computational overhead. [...] Estimations are carried out on a random subset of 2,048 test points for each dataset. [...] For each method, we cross-validate its hyperparameters on each individual task by leveraging Task Negation performance on a small held-out portion of the training set, as implemented by Ilharco et al. (2023); Ortiz-Jimenez et al. (2023).
Hardware Specification	Yes	We execute all the vision experiments using Vi T-B/32, Vi T-B/16, and Vi T-L/14 on a machine equipped with two NVIDIA Ge Force RTX 2080 Ti (11 GB VRAM), an Intel Core i7-9800X CPU @ 3.80GHz and 64 GB of RAM. For all the language experiments using T5-Small, T5-Base, and T5-Large we employ a machine equipped with a a single NVIDIA A100 SXM (64 GB VRAM), an Intel Xeon Platinum 8358 CPU @ 2.60GHz and 64 GB of RAM.
Software Dependencies	No	The timings in Table 3 are obtained using the perf counter clock from Python s time module. We monitored memory footprint using the NVIDIA nvml library 11. All measurements are obtained during fine-tuning, with the very same setup explained in the fine-tuning details. Then, for each method, the mean and standard deviation of the timings are computed over all iterations of all tasks. Peak memory usage, instead, is taken as the maximum over all tasks. Memory usage is recorded at regular intervals of 1 second, starting from the first forward pass and ending when the training loop breaks. Normalized accuracy calculation in Task Addition. Normalized accuracy is computed by taking the average of the normalized individual accuracies over the T tasks. Given a task t, the normalized individual accuracy for t is computed by taking the accuracy of the multi-task fused model on t and dividing it by the single-task accuracy that the fine-tuned checkpoint obtained on t before being fused. Formally, Normalized Accuracy = 1 Accuracy[f(Dt, θ0 + PT t αt τt )] Accuracy[f(Dt, θ0 + αtτt)] (11)
Experiment Setup	Yes	All fine-tuning experiments on vision adhere to the training protocol outlined by Ilharco et al. (2022; 2023); Ortiz-Jimenez et al. (2023), with minor modifications made to the training code to accommodate the additional baselines and our method. Specifically, we fine-tune all datasets starting from the same CLIP pre-trained checkpoint, which is obtained from the open clip repository (Gadre et al., 2024). Each model is fine-tuned for 2,000 iterations with a batch size of 128, a learning rate of 10 5, and a cosine annealing learning rate schedule with 200 warm-up steps. We use the Adam W optimizer (Loshchilov & Hutter, 2019). Following Ilharco et al. (2022), the weights of the classification layer, which are derived from encoding a standard set of zero-shot template prompts for each dataset, are frozen during fine-tuning. Freezing this layer ensures no additional learnable parameters are introduced and does not negatively affect accuracy (Ilharco et al., 2022). Regarding the language experiments, we aligned with Yadav et al. (2023); Ilharco et al. (2023) and utilized three variants of the T5 model (Raffel et al., 2020), namely T5-Small, T5-Base, and T5-Large, with training conducted for a maximum of 75,000 steps. We employed an effective training batch size of 1024, with a learning rate of 10 4. To prevent overfitting, we implemented an early stopping mechanism with a patience threshold of 5. During training, we used bfloat16 and the maximum sequence length was set to 128.