FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Authors: Philip Zmushko, Aleksandr Beznosikov, Martin Takáč, Samuel Horváth

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To verify the practical applicability of FRUGAL, we conduct extensive experiments in popular real-world scenarios1. In these experiments, we pre-train LLa MA-like models (up to 1B parameters) on the Colossal Clean Crawled Corpus (C4) dataset (Raffel et al., 2020) and fine-tune Ro BERTa (Liu, 2019) on the GLUE benchmark (Wang, 2018). The results show that our method significantly outperforms previous memory-efficient algorithms while using less memory budget.
Researcher Affiliation Collaboration 1Yandex, Russia 2Moscow Institute of Physics and Technology, Russia 3Ivannikov Institute for System Programming RAS, Russia 4Skolkovo Institute of Science and Technology, Russia 5Mohamed bin Zayed University of Artificial Intelligence, UAE. Correspondence to: Philip Zmushko <EMAIL>.
Pseudocode Yes Algorithm 1 FRUGAL (State-Full, State-Free) Input: model fθ with p parameter sets {θi Rdi}p i=1, loss L, gradient projectors {Pk,i}p i=1, number of steps K... Algorithm 2 FRUGAL (SGDM, SGD)... Algorithm 4 FRUGAL step pseudocode, Py Torch-like... Algorithm 5 Examples of state-full and state-free steps for Algorithm 4
Open Source Code Yes The code is available at https://anonymous.4open.science/r/FRUGAL-D3CA.
Open Datasets Yes In these experiments, we pre-train LLa MA-like models (up to 1B parameters) on the Colossal Clean Crawled Corpus (C4) dataset (Raffel et al., 2020) and fine-tune Ro BERTa (Liu, 2019) on the GLUE benchmark (Wang, 2018).
Dataset Splits Yes To verify the practical applicability of FRUGAL, we conduct extensive experiments in popular real-world scenarios1. In these experiments, we pre-train LLa MA-like models (up to 1B parameters) on the Colossal Clean Crawled Corpus (C4) dataset (Raffel et al., 2020) and fine-tune Ro BERTa (Liu, 2019) on the GLUE benchmark (Wang, 2018)... We evaluated the performance of our framework in memoryefficient fine-tuning using the GLUE benchmark (Wang, 2018), a widely-used collection of tasks for evaluating language models... Following the experimental protocol from Hu et al. (2023), we apply memory-efficient methods to the same parameter subsets: the Q, K, V, Up, and Down projection matrices. We used the same hyperparameter configuration as in the original work
Hardware Specification No No specific hardware details for running experiments were provided in the paper. The mention of 'A100-80GB' was in the context of memory requirements for large models, not as hardware used by the authors.
Software Dependencies No The paper implies the use of PyTorch through the pseudocode section 'Algorithm 4 FRUGAL step pseudocode, Py Torch-like', but it does not specify any version numbers for PyTorch or other software libraries.
Experiment Setup Yes The core setup for pre-training is taken from Zhao et al. (2024a). We utilize LLa MA-based (Touvron et al., 2023a) model architectures and train them on the Colossal Clean Crawled Corpus (C4) dataset (Raffel et al., 2020). The C4 dataset is intended for pre-training, making this setup a good approximation of real-world applications. A detailed description of the setup can be found in Appendix A.1... We used standard Adam hyperparameters: β1 = 0.9, β2 = 0.999, ε = 1e 8. For all methods except Ga Lore, we selected the learning rate equal to the optimal learning rate for Adam, which we determined through a grid search among values [1e 4, 3e 4, 1e 3, 3e 3]. FRUGAL s learning rate for the state-free optimizer was set equal to that for the state-full optimizer for simplicity and ease of tuning.