reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Leveraging Submodule Linearity Enhances Task Arithmetic Performance in LLMs

Authors: Rui Dai, Sile Hu, Xu Shen, Yonggang Zhang, Xinmei Tian, Jieping Ye

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate the effectiveness of our proposed method, we conducted merging experiments across models fine-tuned from Llama-2-7B and Llama-2-13B (Touvron et al., 2023) for three distinct tasks: Math, Coding, and Translate. The experimental results demonstrate that our method significantly outperforms various baseline techniques, including standard arithmetic approaches, across different model scales and diverse tasks, particularly with respect to the decomposition level of layers and attn/MLPs.
Researcher Affiliation	Academia	Rui Dai 1 Sile Hu 2 Xu Shen 2 Yonggang Zhang 3 Xinmei Tian 1 Jieping Ye 2 1National Engineering Laboratory for Brain-Inspired Intelligence Technology and Application, University of Science and Technology of China 2Independent Researcher 3Hong Kong Baptist University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Merging Modules with Linearity at Layer Level Input: T Fine-tuned Models with L layers θ1 = {θ1 1, θ2 1, ..., θL 1 }, ..., θT = {θ1 T , θ2 T , ..., θL T }, Pre-trained Model θ0 = {θ1 0, θ2 0, ..., θL 0 }, Task-Related Datasets D1, D2, ..., DT Output: Merged Model θmerge = {θ1 merge, θ2 merge, ..., θL merge} for t = 1 to T do dt {x Dt \| x is sampled randomly, \|dt\| = N} Sample a small set of task-related data H1 t dt Set the input of first layer for i = 1 to L 1 do Hi+1 t f(x; θi 0), x Hi t Prepare input feature sets for Eq.12 for i = 1 to L do αi 1, αi 2, ..., αi T Eq.12 Calculate optimal merging weights with Eq.12 θi merge θi 0 + PT t=1 αi t(θi t θi 0) Merge modules from different models linearly
Open Source Code	Yes	Corresponding authors. Code: https://github.com/deep-analysis-research/SLTA.
Open Datasets	Yes	For the mathematics task, we employ the GSM8K (Cobbe et al., 2021) training set. For the coding task, we utilize the Code Alpaca (Chaudhary, 2023) dataset. Lastly, for the translation task, we apply the zh en dataset from Xu et al. (2024a). ... For evaluation, we test mathematical capability using the GSM8K (Cobbe et al., 2021) test set, coding capability with the Human Eval (Chen et al., 2021), and translation capability with the tools and datasets from (Xu et al., 2024a). ... Qwen-2-0.5B(Yang et al., 2024a) model, along with datasets Gsm8k(Cobbe et al., 2021), Cosmos QA (Huang et al., 2019), Ropes (Lin et al., 2019), and Winogrande (Sakaguchi et al., 2020), are presented in Table 6. We also report result on Qwen-2-7B(Qwen Team, 2024) model.
Dataset Splits	No	The paper states: 'For evaluation, we test mathematical capability using the GSM8K (Cobbe et al., 2021) test set, coding capability with the Human Eval (Chen et al., 2021), and translation capability with the tools and datasets from (Xu et al., 2024a).' and 'we use a limited dataset of only 30 samples per task for calculating the merging weights.' While it mentions using test sets and a limited number of samples for weight calculation, it does not specify the exact percentages or counts for training, validation, and testing splits for the main model fine-tuning experiments, nor does it refer to predefined standard splits for all datasets used.
Hardware Specification	No	The paper mentions using Llama-2-7B and Llama-2-13B as backbone models but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments or training these models.
Software Dependencies	No	The paper mentions 'Fast Chat (Zheng et al., 2023) template' and 'np.linalg.solve' (implying NumPy), but it does not specify version numbers for these or any other software libraries or tools used in the experiments, which is required for reproducibility.
Experiment Setup	Yes	During training, we adopt the prompt of the Fast Chat (Zheng et al., 2023) template and fine-tune the models for 3 epochs with a batch size of 128 and a learning rate of 2 10 5. ... we use a limited dataset of only 30 samples per task for calculating the merging weights.