Tracking the Median of Gradients with a Stochastic Proximal Point Method
Authors: Fabian Schaipp, Guillaume Garrigos, Umut Simsekli, Robert M. Gower
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We finally illustrate our theory on synthetic least-squares experiments where we compare the effectiveness of sample median, sample mean, and several online median estimators. Our results underline that for heavy-tailed noise, using the sample median is highly effective in contrast to the sample mean which is unstable and often does not converge. Our experiments also show that our online median estimates are robust, and require only a single sample per iteration, making it a less expensive alternative to the sample median. We further compare different clipping techniques for training transformer architectures on language modeling tasks, and show that they can improve upon the performance of SGD with momentum, however the gap is relatively small. The paper includes a dedicated section titled "Experiments" (Section 6) detailing these empirical evaluations. |
| Researcher Affiliation | Academia | Fabian Schaipp EMAIL Technical University of Munich and Inria Paris, ENS, PSL Research University; Guillaume Garrigos EMAIL Université Paris Cité and Sorbonne Université, CNRS Laboratoire de Probabilités, Statistique et Modélisation, F-75013 Paris, France; Umut Şimşekli EMAIL Inria, CNRS, ENS, PSL Research University Paris; Robert M. Gower EMAIL CCM, Flatiron Institute, Simons Foundation New York City. All listed institutions are universities or public research organizations. |
| Pseudocode | No | The paper describes algorithms and methods using mathematical notation and textual explanations (e.g., equations for wt+1 and mt+1 in Section 4, or derivations in Section 3.1). However, it does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks with structured, step-by-step instructions in a code-like format. |
| Open Source Code | No | In Section C, the authors state: "They also provide implementations for all tasks at https://github.com/fKunstner/noise-sgd-adam-sign, which we use." This refers to code from a separate, cited work (Kunstner et al. (2023)) that the authors used for their experiments, not source code developed and released by the authors for the methodology described in this paper. |
| Open Datasets | Yes | We consider the same three language modeling tasks as studied in (Kunstner et al., 2023): an encoder-only transformer for the PTB dataset, a Transformer-XL model for the Wiki Text-2 dataset, and fine-tuning a Distill BERT model for question-answering on the SQuAD dataset. |
| Dataset Splits | Yes | We consider the same three language modeling tasks as studied in (Kunstner et al., 2023): an encoder-only transformer for the PTB dataset, a Transformer-XL model for the Wiki Text-2 dataset, and fine-tuning a Distill BERT model for question-answering on the SQuAD dataset. We refer to Appendix C for details. In Section C, it states: For the language modeling experiments, all details are identical to Kunstner et al. (2023), Section A.1. |
| Hardware Specification | No | We thank the Scientific Computing Core at the Flatiron Institute, a division of the Simons Foundation, for the compute facilities and support. This statement acknowledges the use of computing resources but does not provide specific hardware details such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | They also provide implementations for all tasks at https://github.com/fKunstner/noise-sgd-adam-sign, which we use. While this mentions using an implementation, it does not specify any software dependencies (e.g., Python, PyTorch, CUDA) with their respective version numbers that would be needed for reproducibility. |
| Experiment Setup | Yes | We run all methods with a learning rate ηt = 0.01. For VClip, CClip, and Huber we set τ = 1, and we set again µ = 1.345 for Huber. We choose a standard momentum/clipping parameters for all tasks (without tuning): we set β = 0.9 for SGD-M, τ = 0.1 for V/CClip, and β = 0.9, c = 1 for clipped-SGD. For all methods, we tune the learning rate on a log10-scaled grid (tuned values reported in Table 1). For the language modeling experiments... we use batch size 256 for PTB, 320 for Wiki Text-2, and 32 for SQuAD. |