reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Zeroth-Order Fine-Tuning of LLMs with Transferable Static Sparsity

Authors: Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, Zhaozhuo Xu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments across various LLMs and demonstrate that our method achieves competitive performance across various downstream tasks. 4 EXPERIMENTS In this section, we aim to validate the effectiveness of our Sens ZOQ, shown in Figure 4, as a memoryefficient LLM fine-tuning solution. This naturally leads to comparison with other ZO methods, which we evaluate in Section 4.1.
Researcher Affiliation	Academia	1Princeton University, 3University of Pennsylvania, 2Stevens Institute of Technology, 4University of Minnesota, 5Carnegie Mellon University, 6Cornell University
Pseudocode	Yes	Listing 1: Example Py Torch-like code snippet that implements the forward calls with FP16 sparse and FP16 dense parameters. Listing 2: Example Py Torch-like code snippet that implements the forward calls with 16-bit sparse and quantized dense parameters.
Open Source Code	Yes	We provide an open-source implementation at https://github.com/Garl Guo/Sens ZOQ.
Open Datasets	Yes	We use SST-2 (Socher et al., 2013), RTE (Wang et al., 2018), CB (De Marneffe et al., 2019), Bool Q (Clark et al., 2019), WSC (Levesque et al., 2012), Wi C (Pilehvar & Camacho-Collados, 2019), COPA (Roemmele et al., 2011), and Wino Grande (Wino G) (Sakaguchi et al., 2020) datasets. C4 (Raffel et al., 2019) is also mentioned as a pre-training dataset. Ar Xiv (Cohan et al., 2018), a pile of scientific papers. We use the Ar Xiv articles subset from this dataset. https://huggingface.co/datasets/armanc/scientific_papers Open Web Math (Paster et al., 2024), a pile of Internet mathematical proofs. https:// huggingface.co/datasets/open-web-math/open-web-math Wiki103 (Merity et al., 2016), a pile of selected Wikipedia articles. https://huggingface. co/datasets/Salesforce/wikitext
Dataset Splits	Yes	Usually, the training/validation set will be sampled from the original training dataset with size 1000/500 respectively and the evaluation set is of size min(1000, \|original validation or test set\|). However, for CB and COPA, we use 100 for the validation set size. We use 2051/200 for Arc-E, 919/200 for Arc-C, 39705/200 for Hella Swag, 15913/200 for PIQA, 4757/200 for OBQA, 33210/200 for SIQA, 20000/200 for MMLU (training is on an auxiliary training set), and 97267/200 for AQu A.
Hardware Specification	Yes	Figure 15 (subfigure 1 and 3) is trained and evaluated on an single GPU node with 1 NVidia RTX A6000 GPU and 1 Intel Xeon Gold 6342 CPU, with Py Torch version 2.2, Hugging Face Transformer version 4.36, and CUDA 12.2. In subfigure 2 and 4 in Figure 15, we use NVidia A100-SXM4 (40 GB) and AMD EPYC 7543P 32-Core CPU with Py Torch version 2.1, Hugging Face version 4.38.2, and CUDA 12.2.
Software Dependencies	Yes	Figure 15 (subfigure 1 and 3) is trained and evaluated on an single GPU node with 1 NVidia RTX A6000 GPU and 1 Intel Xeon Gold 6342 CPU, with Py Torch version 2.2, Hugging Face Transformer version 4.36, and CUDA 12.2. In subfigure 2 and 4 in Figure 15, we use NVidia A100-SXM4 (40 GB) and AMD EPYC 7543P 32-Core CPU with Py Torch version 2.1, Hugging Face version 4.38.2, and CUDA 12.2.
Experiment Setup	Yes	For all ZO experiments, we use 20,000 training steps with ZO-SGD optimizer (Definition 2). We evaluate on the validation or test set at the end of the training. Usually, the training/validation set will be sampled from the original training dataset with size 1000/500 respectively and the evaluation set is of size min(1000, \|original validation or test set\|). However, for CB and COPA, we use 100 for the validation set size. We use 2051/200 for Arc-E, 919/200 for Arc-C, 39705/200 for Hella Swag, 15913/200 for PIQA, 4757/200 for OBQA, 33210/200 for SIQA, 20000/200 for MMLU (training is on an auxiliary training set), and 97267/200 for AQu A. For all ZO experiments in Table 7, 8, and 9, we use a batch size of 16 except for the Mistral-7B on MMLU experiment in Table 8 we use a batch size of 8 for all methods.