reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed

Authors: Savelii Chezhegov, Klyukin Yaroslav, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evaluations highlight the superiority of clipped versions of Ada Grad/Adam in handling the heavy-tailed noise. ... Numerical experiments. We conducted numerical experiments for synthetic and real-world problems. More precisely, we illustrate the superiority of different versions of Adam/Ada Grad with clipping to the non-clipped versions of Adam/Ada Grad on a simple quadratic problem with additive heavy-tailed noise in the gradients. Next, we also test Adam with and without clipping on the fine-tuning of ALBERT Base model (Lan et al., 2019) on Co La and RTE datasets (Wang et al., 2018) and observe that Adam with clipping significantly outperforms Adam without clipping when the noise is heavy-tailed. We also obtain similar results for the fine-tuning of Ro BERTa Large model (Liu et al., 2019).
Researcher Affiliation	Collaboration	1Moscow Institute of Physics and Technology, Russia 2Ivannikov Institute for System Programming RAS, Russia 3Sber AI Lab, Russia 4Machine Learning and Optimization Laboratory (MLO), EPFL, Lausanne, Switzerland 5Innopolis University, Russia 6The Russian Presidential Academy of National Economy and Public Administration, Russia 7Skolkovo Institute of Science and Technology, Russia 8Mohamed bin Zayed University of Artificial Intelligence, UAE. Correspondence to: Eduard Gorbunov <EMAIL>.
Pseudocode	Yes	Algorithm 1 Adam-norm/Adam D-norm and M-Ada Grad-norm/M-Ada Grad D-norm Algorithm 2 Clip-Adam-norm/Clip-Adam D-Norm and Clip-M-Ada Grad-norm/Clip-M-Ada Grad D-Norm
Open Source Code	Yes	Our code is available online: https://github.com/yaroslavkliukin/Clipped-Ada-Grad-and-Adam.
Open Datasets	Yes	Next, we also test Adam with and without clipping on the finetuning of ALBERT Base model (Lan et al., 2019) on Co La and RTE datasets (Wang et al., 2018) and observe that Adam with clipping significantly outperforms Adam without clipping when the noise is heavy-tailed. ... In Appendix D.3, we also provide additional experiments with the fine-tuning the 355M parameter Ro BERTa Large model (Liu et al., 2019) on the two GLUE (Wang et al., 2018) datasets: QNLI (116k question-answer pairs) and Co La (10.7k linguistic acceptability examples).
Dataset Splits	Yes	Next, we also test Adam with and without clipping on the finetuning of ALBERT Base model (Lan et al., 2019) on Co La and RTE datasets (Wang et al., 2018) and observe that Adam with clipping significantly outperforms Adam without clipping when the noise is heavy-tailed. ... In Appendix D.3, we also provide additional experiments with the fine-tuning the 355M parameter Ro BERTa Large model (Liu et al., 2019) on the two GLUE (Wang et al., 2018) datasets: QNLI (116k question-answer pairs) and Co La (10.7k linguistic acceptability examples).
Hardware Specification	No	The paper does not provide specific hardware details. It only mentions fine-tuning a "355M parameter Ro BERTa Large model" and computational constraints without specifying GPU or CPU models, memory, or other hardware specifications.
Software Dependencies	No	The paper mentions using a "pre-trained model from the Hugging Face library" but does not specify the version of the library or any other software dependencies with version numbers.
Experiment Setup	Yes	We used linear warmup with warmup ratio being 0.1, and hyperparameters were β1 = 0.9, β2 = 0.999, b = ϵ1, where 1 = (1, 1, . . . , 1) Rd. We tuned batchsize and stepsize γ for Adam and selected best values from {4, 8, 16, 32} for the batchsize and from {10 6, 3 10 6, 10 5, 3 10 5, 10 4} for γ. For the Co La dataset, the best batchsize was 16 and γ = 10 5, and for the RTE dataset, the best batchsize was 8 and γ = 10 5. We tested coordinate-wise clipping with λ {0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1} and layer-wise clipping with λ {0.1, 0.2, 0.5, 1, 2, 5, 10}.