Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic
Authors: Ruochen Jin, Bojian Hou, Jiancong Xiao, Weijie Su, Li Shen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The results in Figure 3 indicate that the attention module demonstrates kernel behavior. The triangle dots show the comparison of the kernel behavior test between the attention modules (yellow) and the whole models (blue), respectively. ... In particular, Table 2 shows that our method significantly outperforms its non-linear counterparts (Ilharco et al., 2023) and achieves stateof-the-art results on the task addition benchmarks. All our experiments are performed using the same hardware consisting of four 3090 NVIDIA GPUs with 24GB of memory each, which can be reproduced in less than 150 GPU hours. |
| Researcher Affiliation | Academia | 1 East China Normal University, Shanghai, China 2 University of Pennsylvania, Philadelphia, PA, USA EMAIL,EMAIL, EMAIL |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1The code is available at https://github.com/kyrie-23/task_arithmetic_tangent. |
| Open Datasets | Yes | SVHN (Netzer et al., 2011): The Street View House Numbers dataset... MNIST (Le Cun, 1998): A database of handwritten digits... Euro SAT (Helber et al., 2019): A dataset based on Sentinel-2 satellite images... RESISC45 (Cheng et al., 2017): The remote sensing image scene classification dataset... Cars (Krause et al., 2013): This dataset contains images of cars... DTD (Describable Textures Dataset) (Cimpoi et al., 2014): This dataset is designed for texture recognition... SUN397 (Xiao et al., 2016): The Scene UNderstanding (SUN) dataset... GTSRB (German Traffic Sign Recognition Benchmark) (Stallkamp et al., 2011): This dataset comprises images of German traffic signs... For NLP tasks, we utilize the Flan-T5 (Chung et al., 2022) as our pre-trained language model. For fine-tuning, we employ the Flan-T5-base models on seven tasks derived from the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) with the same random seed 42 to initialize the models. These tasks are Co LA, MNLI, MRPC, QQP, RTE, SST2, and STSB. |
| Dataset Splits | Yes | We fine-tune several CLIP pre-trained Vi Ts (Dosovitskiy et al., 2021) of different sizes following the same setup as Ilharco et al. (2023) on 8 tasks... All the fine-tuning experiments follow the same training protocol specified in Ilharco et al. (Ilharco et al., 2022) with minor modifications to the training code to use linearized models when needed... In the task addition benchmarks, after fine-tuning, we evaluate different scaling coefficients α {0.0, 0.05, 0.1, . . . , 1.0} and choose the value that achieves the highest target metric on a small held-out proportion of the training set as specified in Ilharco et al. (Ilharco et al., 2022). |
| Hardware Specification | Yes | All our experiments are performed using the same hardware consisting of four 3090 NVIDIA GPUs with 24GB of memory each, which can be reproduced in less than 150 GPU hours. |
| Software Dependencies | No | The paper mentions using optimizers like Adam W and pre-trained models from the open clip repository, but does not specify version numbers for key software components or libraries like PyTorch, TensorFlow, or Python. |
| Experiment Setup | Yes | We fine-tune for 2,000 iterations with a batch size of 128, a learning rate of 10 5 and a cosine annealing learning rate schedule with 200 warm-up steps and the Adam W optimizer (Loshchilov & Hutter, 2019). |