reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Low-rank Momentum Factorization for Memory Efficient Training

Authors: Pouria Mahdavinia, Mehrdad Mahdavi

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Mo Fa SGD s effectiveness and efficiency across three large language modeling setups: pretraining, natural language understanding (NLU) fine-tuning, and instruction-tuning.
Researcher Affiliation	Academia	Pouria Mahdavinia EMAIL Department of Computer Science and Engineering The Pennsylvania State University Mehrdad Mahdavi EMAIL Department of Computer Science and Engineering The Pennsylvania State University
Pseudocode	Yes	Algorithm 1 Mo Fa SGD: Momentum Factorized Stochastic Gradient Descent
Open Source Code	Yes	Our implementation is available at https://github.com/pmahdavi/Mo Fa SGD.
Open Datasets	Yes	Fine Web dataset (Penedo et al., 2025)... GLUE benchmark (Wang, 2018)... tulu-3-sft-mixture dataset (Lambert et al., 2024).
Dataset Splits	Yes	measures performance using validation perplexity on a held-out partition of Fine Web... We used 5% of the sampled dataset for validation.
Hardware Specification	Yes	All experiments were conducted on NVIDIA A100 GPUs.
Software Dependencies	No	Experiments were implemented using standard libraries for deep learning, including Py Torch, Hugging Face Transformers, and Accelerate. Specific library versions are detailed in the code repository.
Experiment Setup	Yes	Key hyperparameters selected for the Nano GPT pre-training experiments are summarized in Table 5. ... Learning rates for Adam W, Ga Lore, and Mo Fa SGD were tuned via grid search over {1e 4, 2e 4, 3e 4, 5e 4, 8e 4, 1e 3, 3e 3, 5e 3, 8e 3, 1e 2, 2e 2, 5e 2}. Mo Fa SGD s momentum decay β was tuned over {0.5, 0.85, 0.90, 0.95}. Ga Lore s SVD frequency was tuned over {10, 25, 75, 150, 300}.