Low-rank Momentum Factorization for Memory Efficient Training

Authors: Pouria Mahdavinia, Mehrdad Mahdavi

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Mo Fa SGD s effectiveness and efficiency across three large language modeling setups: pretraining, natural language understanding (NLU) fine-tuning, and instruction-tuning.
Researcher Affiliation Academia Pouria Mahdavinia EMAIL Department of Computer Science and Engineering The Pennsylvania State University Mehrdad Mahdavi EMAIL Department of Computer Science and Engineering The Pennsylvania State University
Pseudocode Yes Algorithm 1 Mo Fa SGD: Momentum Factorized Stochastic Gradient Descent
Open Source Code Yes Our implementation is available at https://github.com/pmahdavi/Mo Fa SGD.
Open Datasets Yes Fine Web dataset (Penedo et al., 2025)... GLUE benchmark (Wang, 2018)... tulu-3-sft-mixture dataset (Lambert et al., 2024).
Dataset Splits Yes measures performance using validation perplexity on a held-out partition of Fine Web... We used 5% of the sampled dataset for validation.
Hardware Specification Yes All experiments were conducted on NVIDIA A100 GPUs.
Software Dependencies No Experiments were implemented using standard libraries for deep learning, including Py Torch, Hugging Face Transformers, and Accelerate. Specific library versions are detailed in the code repository.
Experiment Setup Yes Key hyperparameters selected for the Nano GPT pre-training experiments are summarized in Table 5. ... Learning rates for Adam W, Ga Lore, and Mo Fa SGD were tuned via grid search over {1e 4, 2e 4, 3e 4, 5e 4, 8e 4, 1e 3, 3e 3, 5e 3, 8e 3, 1e 2, 2e 2, 5e 2}. Mo Fa SGD s momentum decay β was tuned over {0.5, 0.85, 0.90, 0.95}. Ga Lore s SVD frequency was tuned over {10, 25, 75, 150, 300}.