Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed

Authors: Savelii Chezhegov, Klyukin Yaroslav, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluations highlight the superiority of clipped versions of Ada Grad/Adam in handling the heavy-tailed noise. ... Numerical experiments. We conducted numerical experiments for synthetic and real-world problems. More precisely, we illustrate the superiority of different versions of Adam/Ada Grad with clipping to the non-clipped versions of Adam/Ada Grad on a simple quadratic problem with additive heavy-tailed noise in the gradients. Next, we also test Adam with and without clipping on the fine-tuning of ALBERT Base model (Lan et al., 2019) on Co La and RTE datasets (Wang et al., 2018) and observe that Adam with clipping significantly outperforms Adam without clipping when the noise is heavy-tailed. We also obtain similar results for the fine-tuning of Ro BERTa Large model (Liu et al., 2019).
Researcher Affiliation Collaboration 1Moscow Institute of Physics and Technology, Russia 2Ivannikov Institute for System Programming RAS, Russia 3Sber AI Lab, Russia 4Machine Learning and Optimization Laboratory (MLO), EPFL, Lausanne, Switzerland 5Innopolis University, Russia 6The Russian Presidential Academy of National Economy and Public Administration, Russia 7Skolkovo Institute of Science and Technology, Russia 8Mohamed bin Zayed University of Artificial Intelligence, UAE. Correspondence to: Eduard Gorbunov <EMAIL>.
Pseudocode Yes Algorithm 1 Adam-norm/Adam D-norm and M-Ada Grad-norm/M-Ada Grad D-norm Algorithm 2 Clip-Adam-norm/Clip-Adam D-Norm and Clip-M-Ada Grad-norm/Clip-M-Ada Grad D-Norm
Open Source Code Yes Our code is available online: https://github.com/yaroslavkliukin/Clipped-Ada-Grad-and-Adam.
Open Datasets Yes Next, we also test Adam with and without clipping on the finetuning of ALBERT Base model (Lan et al., 2019) on Co La and RTE datasets (Wang et al., 2018) and observe that Adam with clipping significantly outperforms Adam without clipping when the noise is heavy-tailed. ... In Appendix D.3, we also provide additional experiments with the fine-tuning the 355M parameter Ro BERTa Large model (Liu et al., 2019) on the two GLUE (Wang et al., 2018) datasets: QNLI (116k question-answer pairs) and Co La (10.7k linguistic acceptability examples).
Dataset Splits Yes Next, we also test Adam with and without clipping on the finetuning of ALBERT Base model (Lan et al., 2019) on Co La and RTE datasets (Wang et al., 2018) and observe that Adam with clipping significantly outperforms Adam without clipping when the noise is heavy-tailed. ... In Appendix D.3, we also provide additional experiments with the fine-tuning the 355M parameter Ro BERTa Large model (Liu et al., 2019) on the two GLUE (Wang et al., 2018) datasets: QNLI (116k question-answer pairs) and Co La (10.7k linguistic acceptability examples).
Hardware Specification No The paper does not provide specific hardware details. It only mentions fine-tuning a "355M parameter Ro BERTa Large model" and computational constraints without specifying GPU or CPU models, memory, or other hardware specifications.
Software Dependencies No The paper mentions using a "pre-trained model from the Hugging Face library" but does not specify the version of the library or any other software dependencies with version numbers.
Experiment Setup Yes We used linear warmup with warmup ratio being 0.1, and hyperparameters were β1 = 0.9, β2 = 0.999, b = ϵ1, where 1 = (1, 1, . . . , 1) Rd. We tuned batchsize and stepsize γ for Adam and selected best values from {4, 8, 16, 32} for the batchsize and from {10 6, 3 10 6, 10 5, 3 10 5, 10 4} for γ. For the Co La dataset, the best batchsize was 16 and γ = 10 5, and for the RTE dataset, the best batchsize was 8 and γ = 10 5. We tested coordinate-wise clipping with λ {0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1} and layer-wise clipping with λ {0.1, 0.2, 0.5, 1, 2, 5, 10}.