Towards Efficient Low-Order Hybrid Optimizer for Language Model Fine-Tuning

Authors: Minping Chen, You-Liang Huang, Zeyi Wen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results across common datasets on different pre-trained backbones (i.e., Ro BERTa-large, OPT-13B and OPT-30B) demonstrate that Lo HO can significantly improve the predictive accuracy and convergence rate of Me ZO, while controlling the memory footprint during fine-tuning.
Researcher Affiliation Academia Minping Chen1, You-Liang Huang1, Zeyi Wen *1,2 1 The Hong Kong University of Science and Technology (Guangzhou) 2 The Hong Kong University of Science and Technology EMAIL, EMAIL
Pseudocode No The paper describes the methodology using mathematical equations and textual descriptions, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Our code is available at https://github.com/Chan-1996/Lo HO.
Open Datasets Yes For the Ro BERTa-large experiments, we used the following datasets: SST-2 (Socher et al. 2013), RTE (Cer et al. 2017), MNLI (Williams, Nangia, and Bowman 2018) and SNLI (Bowman et al. 2015)... for the OPT experiments, we used the following datasets, including RTE (Cer et al. 2017), Bool Q (Clark et al. 2019), CB (De Marneffe, Simons, and Tonhauser 2019), Multi RC (Khashabi et al. 2018) and WIC (Pilehvar and Camacho-Collados 2018).
Dataset Splits Yes for the Ro BERTa-large experiments, we used the following datasets:... We followed the settings of Malladi et al. (2024), which used 512 examples per class for both training and validation... for the OPT experiments,... We randomly sampled 1,000 examples for training, 500 examples for validation, and 1,000 examples for testing, which is the same as Me ZO (Malladi et al. 2024).
Hardware Specification Yes For example, we find that when using an A800 GPU to fine-tune OPT-13B with Me ZO, it exhibits over 10GB of free memory... Memory budget: a single RTX 4090 GPU with 24GB memory... Memory budget: a single A800 GPU with 80GB memory.
Software Dependencies No The paper mentions several optimizers (e.g., Adam, AdamW, SGD, Me ZO) and a 'sparse operations library' but does not provide specific version numbers for any of these software components or the underlying deep learning framework.
Experiment Setup Yes Another question is how to set the ratio of parameters to be updated by the FO optimizer in each layer... the learning rate of the ZO optimizer can be configured to be several orders of magnitude lower than that of the FO optimizer... there is a perturbation scale ϵ in the gradient estimation function (cf. Equation 1) which is commonly set to a value much smaller than one (e.g., 0.01 or 0.001) (Malladi et al. 2024)... For example, for the OPT-30B model, the maximum number of FO layers is four using a single A800 GPU... bz=64 (RoBERTa), bz=16 (OPT-13B), bz=8 (OPT-30B).