Zeroth-Order Fine-Tuning of LLMs with Transferable Static Sparsity
Authors: Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, Zhaozhuo Xu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments across various LLMs and demonstrate that our method achieves competitive performance across various downstream tasks. 4 EXPERIMENTS In this section, we aim to validate the effectiveness of our Sens ZOQ, shown in Figure 4, as a memoryefficient LLM fine-tuning solution. This naturally leads to comparison with other ZO methods, which we evaluate in Section 4.1. |
| Researcher Affiliation | Academia | 1Princeton University, 3University of Pennsylvania, 2Stevens Institute of Technology, 4University of Minnesota, 5Carnegie Mellon University, 6Cornell University |
| Pseudocode | Yes | Listing 1: Example Py Torch-like code snippet that implements the forward calls with FP16 sparse and FP16 dense parameters. Listing 2: Example Py Torch-like code snippet that implements the forward calls with 16-bit sparse and quantized dense parameters. |
| Open Source Code | Yes | We provide an open-source implementation at https://github.com/Garl Guo/Sens ZOQ. |
| Open Datasets | Yes | We use SST-2 (Socher et al., 2013), RTE (Wang et al., 2018), CB (De Marneffe et al., 2019), Bool Q (Clark et al., 2019), WSC (Levesque et al., 2012), Wi C (Pilehvar & Camacho-Collados, 2019), COPA (Roemmele et al., 2011), and Wino Grande (Wino G) (Sakaguchi et al., 2020) datasets. C4 (Raffel et al., 2019) is also mentioned as a pre-training dataset. Ar Xiv (Cohan et al., 2018), a pile of scientific papers. We use the Ar Xiv articles subset from this dataset. https://huggingface.co/datasets/armanc/scientific_papers Open Web Math (Paster et al., 2024), a pile of Internet mathematical proofs. https:// huggingface.co/datasets/open-web-math/open-web-math Wiki103 (Merity et al., 2016), a pile of selected Wikipedia articles. https://huggingface. co/datasets/Salesforce/wikitext |
| Dataset Splits | Yes | Usually, the training/validation set will be sampled from the original training dataset with size 1000/500 respectively and the evaluation set is of size min(1000, |original validation or test set|). However, for CB and COPA, we use 100 for the validation set size. We use 2051/200 for Arc-E, 919/200 for Arc-C, 39705/200 for Hella Swag, 15913/200 for PIQA, 4757/200 for OBQA, 33210/200 for SIQA, 20000/200 for MMLU (training is on an auxiliary training set), and 97267/200 for AQu A. |
| Hardware Specification | Yes | Figure 15 (subfigure 1 and 3) is trained and evaluated on an single GPU node with 1 NVidia RTX A6000 GPU and 1 Intel Xeon Gold 6342 CPU, with Py Torch version 2.2, Hugging Face Transformer version 4.36, and CUDA 12.2. In subfigure 2 and 4 in Figure 15, we use NVidia A100-SXM4 (40 GB) and AMD EPYC 7543P 32-Core CPU with Py Torch version 2.1, Hugging Face version 4.38.2, and CUDA 12.2. |
| Software Dependencies | Yes | Figure 15 (subfigure 1 and 3) is trained and evaluated on an single GPU node with 1 NVidia RTX A6000 GPU and 1 Intel Xeon Gold 6342 CPU, with Py Torch version 2.2, Hugging Face Transformer version 4.36, and CUDA 12.2. In subfigure 2 and 4 in Figure 15, we use NVidia A100-SXM4 (40 GB) and AMD EPYC 7543P 32-Core CPU with Py Torch version 2.1, Hugging Face version 4.38.2, and CUDA 12.2. |
| Experiment Setup | Yes | For all ZO experiments, we use 20,000 training steps with ZO-SGD optimizer (Definition 2). We evaluate on the validation or test set at the end of the training. Usually, the training/validation set will be sampled from the original training dataset with size 1000/500 respectively and the evaluation set is of size min(1000, |original validation or test set|). However, for CB and COPA, we use 100 for the validation set size. We use 2051/200 for Arc-E, 919/200 for Arc-C, 39705/200 for Hella Swag, 15913/200 for PIQA, 4757/200 for OBQA, 33210/200 for SIQA, 20000/200 for MMLU (training is on an auxiliary training set), and 97267/200 for AQu A. For all ZO experiments in Table 7, 8, and 9, we use a batch size of 16 except for the Mistral-7B on MMLU experiment in Table 8 we use a batch size of 8 for all methods. |