reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic

Authors: Ruochen Jin, Bojian Hou, Jiancong Xiao, Weijie Su, Li Shen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The results in Figure 3 indicate that the attention module demonstrates kernel behavior. The triangle dots show the comparison of the kernel behavior test between the attention modules (yellow) and the whole models (blue), respectively. ... In particular, Table 2 shows that our method significantly outperforms its non-linear counterparts (Ilharco et al., 2023) and achieves stateof-the-art results on the task addition benchmarks. All our experiments are performed using the same hardware consisting of four 3090 NVIDIA GPUs with 24GB of memory each, which can be reproduced in less than 150 GPU hours.
Researcher Affiliation	Academia	1 East China Normal University, Shanghai, China 2 University of Pennsylvania, Philadelphia, PA, USA EMAIL,EMAIL, EMAIL
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	1The code is available at https://github.com/kyrie-23/task_arithmetic_tangent.
Open Datasets	Yes	SVHN (Netzer et al., 2011): The Street View House Numbers dataset... MNIST (Le Cun, 1998): A database of handwritten digits... Euro SAT (Helber et al., 2019): A dataset based on Sentinel-2 satellite images... RESISC45 (Cheng et al., 2017): The remote sensing image scene classification dataset... Cars (Krause et al., 2013): This dataset contains images of cars... DTD (Describable Textures Dataset) (Cimpoi et al., 2014): This dataset is designed for texture recognition... SUN397 (Xiao et al., 2016): The Scene UNderstanding (SUN) dataset... GTSRB (German Traffic Sign Recognition Benchmark) (Stallkamp et al., 2011): This dataset comprises images of German traffic signs... For NLP tasks, we utilize the Flan-T5 (Chung et al., 2022) as our pre-trained language model. For fine-tuning, we employ the Flan-T5-base models on seven tasks derived from the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) with the same random seed 42 to initialize the models. These tasks are Co LA, MNLI, MRPC, QQP, RTE, SST2, and STSB.
Dataset Splits	Yes	We fine-tune several CLIP pre-trained Vi Ts (Dosovitskiy et al., 2021) of different sizes following the same setup as Ilharco et al. (2023) on 8 tasks... All the fine-tuning experiments follow the same training protocol specified in Ilharco et al. (Ilharco et al., 2022) with minor modifications to the training code to use linearized models when needed... In the task addition benchmarks, after fine-tuning, we evaluate different scaling coefficients α {0.0, 0.05, 0.1, . . . , 1.0} and choose the value that achieves the highest target metric on a small held-out proportion of the training set as specified in Ilharco et al. (Ilharco et al., 2022).
Hardware Specification	Yes	All our experiments are performed using the same hardware consisting of four 3090 NVIDIA GPUs with 24GB of memory each, which can be reproduced in less than 150 GPU hours.
Software Dependencies	No	The paper mentions using optimizers like Adam W and pre-trained models from the open clip repository, but does not specify version numbers for key software components or libraries like PyTorch, TensorFlow, or Python.
Experiment Setup	Yes	We fine-tune for 2,000 iterations with a batch size of 128, a learning rate of 10 5 and a cosine annealing learning rate schedule with 200 warm-up steps and the Adam W optimizer (Loshchilov & Hutter, 2019).