Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

Authors: Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, Zaiwen Wen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across various model sizes and downstream tasks demonstrate that LOZO and its momentum-based variant outperform existing ZO methods and closely approach the performance of FO algorithms. We conduct extensive experiments across various model scales (ranging from 350M to 66B) and downstream tasks, including classification, multiple-choice, and generation. (Section 1, Contributions and Section 5, EXPERIMENTS)
Researcher Affiliation Academia Peking University, Beijing, China; Nanjing University, Nanjing, China AI for Science Institute, Beijing, China Corresponding author: EMAIL
Pseudocode Yes Algorithm 1: Low-rank ZO-SGD (LOZO) (Page 4) Algorithm 2: Low-rank ZO-SGD with Momentum (LOZO-M) (Page 7)
Open Source Code No The paper does not contain an explicit statement about code release, a link to a code repository, or mention of code being provided in supplementary materials for the methodology described.
Open Datasets Yes We conduct experiments employing RoBERTa-large on tasks including sentiment classification, natural language inference and topic classification. For OPT, we conduct experiments on the following datasets: SST-2, RTE, CB (De Marneffe et al., 2019), BoolQ (Clark et al., 2019), WSC (Levesque et al., 2012), WiC (Pilehvar & Camacho Collados, 2018), MultiRC (Khashabi et al., 2018), COPA (Roemmele et al., 2011), ReCoRD (Zhang et al., 2018), SQuAD (Rajpurkar et al., 2016), DROP (Dua et al., 2019). For the LLaMA model, we evaluate its performance on the SST-2, WiC, COPA, SQuAD, and WinoGrande (Sakaguchi et al., 2021) datasets.
Dataset Splits Yes We adopt two settings: k = 16 and k = 512, which require 16 and 512 examples per class, respectively, during both the training and validation stages. (Appendix C.1)
Hardware Specification Yes Table 10: Comparison of memory costs for LOZO, Me ZO, their momentum variants, and gradient-based methods on OPT-13B. Memory Consumed GPUs LOZO 27.0 GB 1 A800. Table 12: Comparison of memory costs for LOZO, Me ZO, and gradient-based methods on LLaMA models of varying scales for the MultiRC task with a per-device batch size of 1. Memory Consumed GPUs LOZO 14.1 GB 1 A800
Software Dependencies No The paper does not provide specific software names with version numbers used for the experiments. It mentions algorithms and models like Adam, SGD, RoBERTa-large, OPT, and LLaMA, but not their specific software implementations with version numbers.
Experiment Setup Yes We conduct 100K training steps, evaluating the model every 10K steps for the RoBERTa-large model; 20K training steps with evaluations every 4K steps for the OPT model; and 20K training steps, evaluating every 500 steps for the LLaMA model. Both Me ZO and LOZO utilize a constant learning rate schedule, whereas FT and FT-Lo RA adopt a linear learning rate schedule. Table 4: The hyperparameter grids used for RoBERTa-large experiments. Table 5: The hyperparameter grids used for OPT experiments. Table 6: The hyperparameter grids used for LLaMA experiments.