Enhancing Large Language Model Performance with Gradient-Based Parameter Selection
Authors: Haoling Li, Xin Zhang, Xiao Liu, Yeyun Gong, Yifan Wang, Qi Chen, Peng Cheng
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results in various training paradigms like SFT and DPO for various domains of tasks demonstrate that GMT not only preserves the original network structure but also enhances the potential performance of LLMs. We conduct both theoretical analysis and exhaustive experiments with different base models on several benchmarks comparing with representative baselines. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2 Microsoft Research EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Gradient-Mask Tuning |
| Open Source Code | No | The paper does not explicitly state that the authors' code is open-source or provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | For code generation, we employ the Magicoder-Evol-Instruct-110K (Wei et al. 2023) as the training data... The trained models are evaluated using the Human Eval (Chen et al. 2021) and MBPP (Austin et al. 2021) benchmarks... For math reasoning, the Meta Math QA (Yu et al. 2023a) dataset is employed to fine-tune on the MISTRAL-7B and LLAMA3-8B models. The evaluation is conducted using the GSM8k (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021) benchmarks... For the general domain, the T ULU V2 (Wang et al. 2024a) dataset is utilized in SFT phase training on the LLAMA2-7B (Touvron et al. 2023) and LLAMA213B model, the Ultra Feedback (Cui et al. 2023) is utilized in DPO phase training. Following HFT (Hui et al. 2024), we evaluate model on MMLU (Hendrycks et al. 2020), GSM8k (Cobbe et al. 2021), BBH (Suzgun et al. 2023), Ty Di QA (Clark et al. 2020), Truthful QA (Lin, Hilton, and Evans 2022) and Human Eval (Chen et al. 2021). |
| Dataset Splits | Yes | For code generation, we employ the Magicoder-Evol-Instruct-110K (Wei et al. 2023) as the training data... The trained models are evaluated using the Human Eval (Chen et al. 2021) and MBPP (Austin et al. 2021) benchmarks... For math reasoning, the Meta Math QA (Yu et al. 2023a) dataset is employed to fine-tune on the MISTRAL-7B and LLAMA3-8B models. The evaluation is conducted using the GSM8k (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021) benchmarks... These are standard and well-defined benchmarks with inherent training and evaluation splits. |
| Hardware Specification | Yes | All training experiments were done on NVIDIA A100 and NVIDIA H100 machines. |
| Software Dependencies | No | The paper mentions using BFloat16 precision and a cosine learning rate scheduler, but does not provide specific version numbers for any key software components like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | In addition, we utilize BFloat16 precision and set the weight decay to 0. We use the cosine learning rate scheduler after a linear warm-up stage with a ratio of 0.03. |