LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

Authors: Thomas Robert, Mher Safaryan, Ionut-Vlad Modoranu, Dan Alistarh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate LDAdam for fine-tuning and pre-training large language models against baseline Adam (Kingma & Ba, 2017) and the memory-efficient Ga Lore (Zhao et al., 2024) (Algorithm 2). We apply LDAdam for fine-tuning Ro BERTa (Liu et al., 2019) and Llama-family (Touvron et al., 2023) models on the GLUE (Wang et al., 2018) and Grade-School Math (GSM) (Cobbe et al., 2021) benchmarks, respectively.
Researcher Affiliation Academia Thomas Robert1 , Mher Safaryan2, Ionut-Vlad Modoranu2, Dan Alistarh2 1Institut Polytechnique de Paris (IPP) 2Institute of Science and Technology Austria (ISTA) Correspondence to EMAIL.
Pseudocode Yes Algorithm 1 LDAdam ( Practical View Only, gt Rn m / Analytical View Only, gt Rd )
Open Source Code Yes Code is available at https://github.com/IST-DASLab/LDAdam.
Open Datasets Yes We apply LDAdam for fine-tuning Ro BERTa (Liu et al., 2019) and Llama-family (Touvron et al., 2023) models on the GLUE (Wang et al., 2018) and Grade-School Math (GSM) (Cobbe et al., 2021) benchmarks, respectively. [...] We evaluate LDAdam for pre-training Llama models (Touvron et al., 2023) on the C4 dataset (Raffel et al., 2023).
Dataset Splits No The paper mentions using GLUE, GSM8K, and C4 datasets and benchmarks, and references standard training parameters like epochs, batch size, and sequence length. However, it does not explicitly provide details about how these datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or explicit reference to standard splits used for reproducibility).
Hardware Specification Yes Table 6 reports peak memory for fine-tuning and pre-training on a single NVIDIA H100 80GB GPU with micro batch size 1 and without activation checkpointing.
Software Dependencies No The paper mentions 'Py Torch implementation' and provides examples using PyTorch functions. However, it does not specify version numbers for PyTorch or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes Tables 10, 11, and 12 detail all the hyperparameters we use when fine-tuning respectively Ro BERTa-base model on the GLUE benchmark and Llama-2 7B model on the GSM8K dataset, and when pre-training Llama models on the C4 dataset. These include Epochs, Batch Size, Learning Rate, Decay Rate β1, Decay Rate β2, Weight Decay, Dropout, Gradient Clipping, etc.