Low-rank Momentum Factorization for Memory Efficient Training
Authors: Pouria Mahdavinia, Mehrdad Mahdavi
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Mo Fa SGD s effectiveness and efficiency across three large language modeling setups: pretraining, natural language understanding (NLU) fine-tuning, and instruction-tuning. |
| Researcher Affiliation | Academia | Pouria Mahdavinia EMAIL Department of Computer Science and Engineering The Pennsylvania State University Mehrdad Mahdavi EMAIL Department of Computer Science and Engineering The Pennsylvania State University |
| Pseudocode | Yes | Algorithm 1 Mo Fa SGD: Momentum Factorized Stochastic Gradient Descent |
| Open Source Code | Yes | Our implementation is available at https://github.com/pmahdavi/Mo Fa SGD. |
| Open Datasets | Yes | Fine Web dataset (Penedo et al., 2025)... GLUE benchmark (Wang, 2018)... tulu-3-sft-mixture dataset (Lambert et al., 2024). |
| Dataset Splits | Yes | measures performance using validation perplexity on a held-out partition of Fine Web... We used 5% of the sampled dataset for validation. |
| Hardware Specification | Yes | All experiments were conducted on NVIDIA A100 GPUs. |
| Software Dependencies | No | Experiments were implemented using standard libraries for deep learning, including Py Torch, Hugging Face Transformers, and Accelerate. Specific library versions are detailed in the code repository. |
| Experiment Setup | Yes | Key hyperparameters selected for the Nano GPT pre-training experiments are summarized in Table 5. ... Learning rates for Adam W, Ga Lore, and Mo Fa SGD were tuned via grid search over {1e 4, 2e 4, 3e 4, 5e 4, 8e 4, 1e 3, 3e 3, 5e 3, 8e 3, 1e 2, 2e 2, 5e 2}. Mo Fa SGD s momentum decay β was tuned over {0.5, 0.85, 0.90, 0.95}. Ga Lore s SVD frequency was tuned over {10, 25, 75, 150, 300}. |