reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Large Language Model Performance with Gradient-Based Parameter Selection

Authors: Haoling Li, Xin Zhang, Xiao Liu, Yeyun Gong, Yifan Wang, Qi Chen, Peng Cheng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results in various training paradigms like SFT and DPO for various domains of tasks demonstrate that GMT not only preserves the original network structure but also enhances the potential performance of LLMs. We conduct both theoretical analysis and exhaustive experiments with different base models on several benchmarks comparing with representative baselines.
Researcher Affiliation	Collaboration	1Tsinghua University 2 Microsoft Research EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Gradient-Mask Tuning
Open Source Code	No	The paper does not explicitly state that the authors' code is open-source or provide a link to a code repository for the methodology described.
Open Datasets	Yes	For code generation, we employ the Magicoder-Evol-Instruct-110K (Wei et al. 2023) as the training data... The trained models are evaluated using the Human Eval (Chen et al. 2021) and MBPP (Austin et al. 2021) benchmarks... For math reasoning, the Meta Math QA (Yu et al. 2023a) dataset is employed to fine-tune on the MISTRAL-7B and LLAMA3-8B models. The evaluation is conducted using the GSM8k (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021) benchmarks... For the general domain, the T ULU V2 (Wang et al. 2024a) dataset is utilized in SFT phase training on the LLAMA2-7B (Touvron et al. 2023) and LLAMA213B model, the Ultra Feedback (Cui et al. 2023) is utilized in DPO phase training. Following HFT (Hui et al. 2024), we evaluate model on MMLU (Hendrycks et al. 2020), GSM8k (Cobbe et al. 2021), BBH (Suzgun et al. 2023), Ty Di QA (Clark et al. 2020), Truthful QA (Lin, Hilton, and Evans 2022) and Human Eval (Chen et al. 2021).
Dataset Splits	Yes	For code generation, we employ the Magicoder-Evol-Instruct-110K (Wei et al. 2023) as the training data... The trained models are evaluated using the Human Eval (Chen et al. 2021) and MBPP (Austin et al. 2021) benchmarks... For math reasoning, the Meta Math QA (Yu et al. 2023a) dataset is employed to fine-tune on the MISTRAL-7B and LLAMA3-8B models. The evaluation is conducted using the GSM8k (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021) benchmarks... These are standard and well-defined benchmarks with inherent training and evaluation splits.
Hardware Specification	Yes	All training experiments were done on NVIDIA A100 and NVIDIA H100 machines.
Software Dependencies	No	The paper mentions using BFloat16 precision and a cosine learning rate scheduler, but does not provide specific version numbers for any key software components like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	In addition, we utilize BFloat16 precision and set the weight decay to 0. We use the cosine learning rate scheduler after a linear warm-up stage with a ratio of 0.03.