MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates
Authors: Mohammad Mozaffari, Sikan Li, Zhao Zhang, Maryam Mehri Dehnavi
NeurIPS 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that MKOR outperforms state-of-the-art first-order methods, e.g. the LAMB optimizer, and best implementations of second-order methods, i.e. KAISA/KFAC, up to 2.57 and 1.85 respectively on BERT-Large-Uncased on 64 GPUs. |
| Researcher Affiliation | Academia | Mohammad Mozaffari Department of Computer Science University of Toronto EMAIL Sikan Li Texas Advanced Computing Server EMAIL Zhao Zhang Department of Electrical and Computer Engineering Rutgers University EMAIL Maryam Mehri Dehnavi Department of Computer Science University of Toronto EMAIL |
| Pseudocode | Yes | Algorithm 1: MKOR Algorithm for a Single Layer m |
| Open Source Code | Yes | Our code base is publicly available on https://github.com/Mohammad-Mozaffari/mkor, and the instructions for running each experiment are available there. |
| Open Datasets | Yes | For the pre-training process, the English Wikipedia [30] and the Toronto Book Corpus [34] dataset, which was used in the original BERT pre-training, are used; the latter dataset is not thoroughly available which results in a small reduction in the baseline accuracies achieved in our experiments from the original BERT results. Following [21], due to the time-intensive process of hyperparameter tuning for the first phase of pre-training, we report the effectiveness of MKOR in the second phase of pre-training only while using the checkpoints of the first phase generated using LAMB optimizer. |
| Dataset Splits | Yes | Alex Net [12] with more than 20M parameters on CIFAR-100 [11] consisting of 50K training and 10K validation images of 100 classes. |
| Hardware Specification | Yes | MKOR outperforms state-of-the-art first-order methods, e.g. the LAMB optimizer, and best implementations of second-order methods, i.e. KAISA/KFAC, up to 2.57 and 1.85 respectively on BERT-Large-Uncased on 64 GPUs. (...) For the BERT-Large-Uncased pre-training and fine-tuning experiments, we have used up to 64 A100 GPUs on the Polaris [3] cluster which has 560 nodes, each with 4 NVIDIA A-100 GPUs with NVLink interconnects. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., PyTorch version, TensorFlow version, CUDA version). |
| Experiment Setup | Yes | For the BERT-Large-Uncased pre-training, we use the same hyperparameters used in [18]. The factors in KAISA are updated every 50 iterations, and the factors in MKOR and MKOR-H are updated every 10 iterations. (...) For SGD and KAISA, we use the same hyperparameters used in [21]. The factors in MKOR are updated every 10 iterations, and the learning rate used there is the same is KAISA. The learning rate in MKOR decays by a factor of 2 at the end of epochs 25, 35, 40, 45, 50, 55, and 56. |