reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tracking the Median of Gradients with a Stochastic Proximal Point Method

Authors: Fabian Schaipp, Guillaume Garrigos, Umut Simsekli, Robert M. Gower

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We finally illustrate our theory on synthetic least-squares experiments where we compare the effectiveness of sample median, sample mean, and several online median estimators. Our results underline that for heavy-tailed noise, using the sample median is highly effective in contrast to the sample mean which is unstable and often does not converge. Our experiments also show that our online median estimates are robust, and require only a single sample per iteration, making it a less expensive alternative to the sample median. We further compare different clipping techniques for training transformer architectures on language modeling tasks, and show that they can improve upon the performance of SGD with momentum, however the gap is relatively small. The paper includes a dedicated section titled "Experiments" (Section 6) detailing these empirical evaluations.
Researcher Affiliation	Academia	Fabian Schaipp EMAIL Technical University of Munich and Inria Paris, ENS, PSL Research University; Guillaume Garrigos EMAIL Université Paris Cité and Sorbonne Université, CNRS Laboratoire de Probabilités, Statistique et Modélisation, F-75013 Paris, France; Umut Şimşekli EMAIL Inria, CNRS, ENS, PSL Research University Paris; Robert M. Gower EMAIL CCM, Flatiron Institute, Simons Foundation New York City. All listed institutions are universities or public research organizations.
Pseudocode	No	The paper describes algorithms and methods using mathematical notation and textual explanations (e.g., equations for wt+1 and mt+1 in Section 4, or derivations in Section 3.1). However, it does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks with structured, step-by-step instructions in a code-like format.
Open Source Code	No	In Section C, the authors state: "They also provide implementations for all tasks at https://github.com/fKunstner/noise-sgd-adam-sign, which we use." This refers to code from a separate, cited work (Kunstner et al. (2023)) that the authors used for their experiments, not source code developed and released by the authors for the methodology described in this paper.
Open Datasets	Yes	We consider the same three language modeling tasks as studied in (Kunstner et al., 2023): an encoder-only transformer for the PTB dataset, a Transformer-XL model for the Wiki Text-2 dataset, and fine-tuning a Distill BERT model for question-answering on the SQuAD dataset.
Dataset Splits	Yes	We consider the same three language modeling tasks as studied in (Kunstner et al., 2023): an encoder-only transformer for the PTB dataset, a Transformer-XL model for the Wiki Text-2 dataset, and fine-tuning a Distill BERT model for question-answering on the SQuAD dataset. We refer to Appendix C for details. In Section C, it states: For the language modeling experiments, all details are identical to Kunstner et al. (2023), Section A.1.
Hardware Specification	No	We thank the Scientific Computing Core at the Flatiron Institute, a division of the Simons Foundation, for the compute facilities and support. This statement acknowledges the use of computing resources but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	They also provide implementations for all tasks at https://github.com/fKunstner/noise-sgd-adam-sign, which we use. While this mentions using an implementation, it does not specify any software dependencies (e.g., Python, PyTorch, CUDA) with their respective version numbers that would be needed for reproducibility.
Experiment Setup	Yes	We run all methods with a learning rate ηt = 0.01. For VClip, CClip, and Huber we set τ = 1, and we set again µ = 1.345 for Huber. We choose a standard momentum/clipping parameters for all tasks (without tuning): we set β = 0.9 for SGD-M, τ = 0.1 for V/CClip, and β = 0.9, c = 1 for clipped-SGD. For all methods, we tune the learning rate on a log10-scaled grid (tuned values reported in Table 1). For the language modeling experiments... we use batch size 256 for PTB, 320 for Wiki Text-2, and 32 for SQuAD.