Leveraging Submodule Linearity Enhances Task Arithmetic Performance in LLMs
Authors: Rui Dai, Sile Hu, Xu Shen, Yonggang Zhang, Xinmei Tian, Jieping Ye
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the effectiveness of our proposed method, we conducted merging experiments across models fine-tuned from Llama-2-7B and Llama-2-13B (Touvron et al., 2023) for three distinct tasks: Math, Coding, and Translate. The experimental results demonstrate that our method significantly outperforms various baseline techniques, including standard arithmetic approaches, across different model scales and diverse tasks, particularly with respect to the decomposition level of layers and attn/MLPs. |
| Researcher Affiliation | Academia | Rui Dai 1 Sile Hu 2 Xu Shen 2 Yonggang Zhang 3 Xinmei Tian 1 Jieping Ye 2 1National Engineering Laboratory for Brain-Inspired Intelligence Technology and Application, University of Science and Technology of China 2Independent Researcher 3Hong Kong Baptist University EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Merging Modules with Linearity at Layer Level Input: T Fine-tuned Models with L layers θ1 = {θ1 1, θ2 1, ..., θL 1 }, ..., θT = {θ1 T , θ2 T , ..., θL T }, Pre-trained Model θ0 = {θ1 0, θ2 0, ..., θL 0 }, Task-Related Datasets D1, D2, ..., DT Output: Merged Model θmerge = {θ1 merge, θ2 merge, ..., θL merge} for t = 1 to T do dt {x Dt | x is sampled randomly, |dt| = N} Sample a small set of task-related data H1 t dt Set the input of first layer for i = 1 to L 1 do Hi+1 t f(x; θi 0), x Hi t Prepare input feature sets for Eq.12 for i = 1 to L do αi 1, αi 2, ..., αi T Eq.12 Calculate optimal merging weights with Eq.12 θi merge θi 0 + PT t=1 αi t(θi t θi 0) Merge modules from different models linearly |
| Open Source Code | Yes | Corresponding authors. Code: https://github.com/deep-analysis-research/SLTA. |
| Open Datasets | Yes | For the mathematics task, we employ the GSM8K (Cobbe et al., 2021) training set. For the coding task, we utilize the Code Alpaca (Chaudhary, 2023) dataset. Lastly, for the translation task, we apply the zh en dataset from Xu et al. (2024a). ... For evaluation, we test mathematical capability using the GSM8K (Cobbe et al., 2021) test set, coding capability with the Human Eval (Chen et al., 2021), and translation capability with the tools and datasets from (Xu et al., 2024a). ... Qwen-2-0.5B(Yang et al., 2024a) model, along with datasets Gsm8k(Cobbe et al., 2021), Cosmos QA (Huang et al., 2019), Ropes (Lin et al., 2019), and Winogrande (Sakaguchi et al., 2020), are presented in Table 6. We also report result on Qwen-2-7B(Qwen Team, 2024) model. |
| Dataset Splits | No | The paper states: 'For evaluation, we test mathematical capability using the GSM8K (Cobbe et al., 2021) test set, coding capability with the Human Eval (Chen et al., 2021), and translation capability with the tools and datasets from (Xu et al., 2024a).' and 'we use a limited dataset of only 30 samples per task for calculating the merging weights.' While it mentions using test sets and a limited number of samples for weight calculation, it does not specify the exact percentages or counts for training, validation, and testing splits for the main model fine-tuning experiments, nor does it refer to predefined standard splits for all datasets used. |
| Hardware Specification | No | The paper mentions using Llama-2-7B and Llama-2-13B as backbone models but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments or training these models. |
| Software Dependencies | No | The paper mentions 'Fast Chat (Zheng et al., 2023) template' and 'np.linalg.solve' (implying NumPy), but it does not specify version numbers for these or any other software libraries or tools used in the experiments, which is required for reproducibility. |
| Experiment Setup | Yes | During training, we adopt the prompt of the Fast Chat (Zheng et al., 2023) template and fine-tune the models for 3 epochs with a batch size of 128 and a learning rate of 2 10 5. ... we use a limited dataset of only 30 samples per task for calculating the merging weights. |