reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

Authors: Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, Zaiwen Wen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across various model sizes and downstream tasks demonstrate that LOZO and its momentum-based variant outperform existing ZO methods and closely approach the performance of FO algorithms. We conduct extensive experiments across various model scales (ranging from 350M to 66B) and downstream tasks, including classification, multiple-choice, and generation. (Section 1, Contributions and Section 5, EXPERIMENTS)
Researcher Affiliation	Academia	Peking University, Beijing, China; Nanjing University, Nanjing, China AI for Science Institute, Beijing, China Corresponding author: EMAIL
Pseudocode	Yes	Algorithm 1: Low-rank ZO-SGD (LOZO) (Page 4) Algorithm 2: Low-rank ZO-SGD with Momentum (LOZO-M) (Page 7)
Open Source Code	No	The paper does not contain an explicit statement about code release, a link to a code repository, or mention of code being provided in supplementary materials for the methodology described.
Open Datasets	Yes	We conduct experiments employing RoBERTa-large on tasks including sentiment classification, natural language inference and topic classification. For OPT, we conduct experiments on the following datasets: SST-2, RTE, CB (De Marneffe et al., 2019), BoolQ (Clark et al., 2019), WSC (Levesque et al., 2012), WiC (Pilehvar & Camacho Collados, 2018), MultiRC (Khashabi et al., 2018), COPA (Roemmele et al., 2011), ReCoRD (Zhang et al., 2018), SQuAD (Rajpurkar et al., 2016), DROP (Dua et al., 2019). For the LLaMA model, we evaluate its performance on the SST-2, WiC, COPA, SQuAD, and WinoGrande (Sakaguchi et al., 2021) datasets.
Dataset Splits	Yes	We adopt two settings: k = 16 and k = 512, which require 16 and 512 examples per class, respectively, during both the training and validation stages. (Appendix C.1)
Hardware Specification	Yes	Table 10: Comparison of memory costs for LOZO, Me ZO, their momentum variants, and gradient-based methods on OPT-13B. Memory Consumed GPUs LOZO 27.0 GB 1 A800. Table 12: Comparison of memory costs for LOZO, Me ZO, and gradient-based methods on LLaMA models of varying scales for the MultiRC task with a per-device batch size of 1. Memory Consumed GPUs LOZO 14.1 GB 1 A800
Software Dependencies	No	The paper does not provide specific software names with version numbers used for the experiments. It mentions algorithms and models like Adam, SGD, RoBERTa-large, OPT, and LLaMA, but not their specific software implementations with version numbers.
Experiment Setup	Yes	We conduct 100K training steps, evaluating the model every 10K steps for the RoBERTa-large model; 20K training steps with evaluations every 4K steps for the OPT model; and 20K training steps, evaluating every 500 steps for the LLaMA model. Both Me ZO and LOZO utilize a constant learning rate schedule, whereas FT and FT-Lo RA adopt a linear learning rate schedule. Table 4: The hyperparameter grids used for RoBERTa-large experiments. Table 5: The hyperparameter grids used for OPT experiments. Table 6: The hyperparameter grids used for LLaMA experiments.