LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics
Authors: Thomas Robert, Mher Safaryan, Ionut-Vlad Modoranu, Dan Alistarh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate LDAdam for fine-tuning and pre-training large language models against baseline Adam (Kingma & Ba, 2017) and the memory-efficient Ga Lore (Zhao et al., 2024) (Algorithm 2). We apply LDAdam for fine-tuning Ro BERTa (Liu et al., 2019) and Llama-family (Touvron et al., 2023) models on the GLUE (Wang et al., 2018) and Grade-School Math (GSM) (Cobbe et al., 2021) benchmarks, respectively. |
| Researcher Affiliation | Academia | Thomas Robert1 , Mher Safaryan2, Ionut-Vlad Modoranu2, Dan Alistarh2 1Institut Polytechnique de Paris (IPP) 2Institute of Science and Technology Austria (ISTA) Correspondence to EMAIL. |
| Pseudocode | Yes | Algorithm 1 LDAdam ( Practical View Only, gt Rn m / Analytical View Only, gt Rd ) |
| Open Source Code | Yes | Code is available at https://github.com/IST-DASLab/LDAdam. |
| Open Datasets | Yes | We apply LDAdam for fine-tuning Ro BERTa (Liu et al., 2019) and Llama-family (Touvron et al., 2023) models on the GLUE (Wang et al., 2018) and Grade-School Math (GSM) (Cobbe et al., 2021) benchmarks, respectively. [...] We evaluate LDAdam for pre-training Llama models (Touvron et al., 2023) on the C4 dataset (Raffel et al., 2023). |
| Dataset Splits | No | The paper mentions using GLUE, GSM8K, and C4 datasets and benchmarks, and references standard training parameters like epochs, batch size, and sequence length. However, it does not explicitly provide details about how these datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or explicit reference to standard splits used for reproducibility). |
| Hardware Specification | Yes | Table 6 reports peak memory for fine-tuning and pre-training on a single NVIDIA H100 80GB GPU with micro batch size 1 and without activation checkpointing. |
| Software Dependencies | No | The paper mentions 'Py Torch implementation' and provides examples using PyTorch functions. However, it does not specify version numbers for PyTorch or any other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | Tables 10, 11, and 12 detail all the hyperparameters we use when fine-tuning respectively Ro BERTa-base model on the GLUE benchmark and Llama-2 7B model on the GSM8K dataset, and when pre-training Llama models on the C4 dataset. These include Epochs, Batch Size, Learning Rate, Decay Rate β1, Decay Rate β2, Weight Decay, Dropout, Gradient Clipping, etc. |