Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson–Romberg Extrapolation

Authors: Marina Sheshukova, Denis Belomestny, Alain Oliviero Durmus, Eric Moulines, Aleksei Naumov, Sergey Samsonov

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 NUMERICAL RESULTS In this section we illustrate numerically the scale of the second-order terms in equation (41) in Corollary 7. We show that, for a particular minimization problem, setting γ = n 1/2, we achieve the scaling of the second-order terms in root-MSE bounds of order O(n 3/4). We consider the problem min θ R f(θ) , f(θ) = θ2 + cos θ , with the stochastic gradient oracles F(θ, ξ) given by F(θ, ξ) = 2θ sin θ+ξ, and ξ N(0, 1). This example satisfies the assumptions A1, A2, A3(p) with any p 2. We select different sample sizes n, choose γ = 1/ n, and construct the associated estimates θ(γ) n and θ(2γ) n . Detailed description of the experimental setting is provided in Appendix E. Then for each n we compute the Richardson-Romberg estimates θ(RR) n from (36) alongside with its versions without the leading term in n, i.e. θ(RR) n + n 1 P2n k=n+1 εk+1(θ ). We provide first the plot for θ(RR) n θ 2 and θ(RR) n +n 1 P2n k=n+1 εk+1(θ ) θ 2, averaged over M = 320 parallel runs, in Figure 1. On the same figure we also provide the plots for rescaled errors n θ(RR) n θ 2 and n3/2 θ(RR) n θ + n 1 X2n k=n+1 εk+1(θ ) 2 , also averaged over M parallel runs. The corresponding plot indicates that the proper scaling of the squared norm of the remainder part is n 3/2, that is, the corresponding term in root-MSE bound for E1/2 ν [ θ(RR) n θ 2] scales as O(n 3/4), as predicted by Corollary 7. Figure 1: Left picture: Richardson-Romberg experimental error with and without the leading term 1 n P2n k=n+1 εk+1(θ ). Right picture: same errors after rescaling by n and n3/2, respectively.
Researcher Affiliation Academia Marina Sheshukova1,5 Denis Belomestny2,1 Alain Durmus 3 Eric Moulines3,4 Alexey Naumov1,6 Sergey Samsonov1 1HSE University 2Duisburg-Essen University 3CMAP, UMR 7641, Ecole Polytechnique 4Mohamed Bin Zayed University of AI 5 Skolkovo Institute of Science and Technology 6 Steklov Mathematical Institute of Russian Academy of Sciences EMAIL EMAIL EMAIL
Pseudocode No The paper describes methods using mathematical equations and derivations, such as (11) and (36), but does not contain a structured pseudocode or algorithm block.
Open Source Code Yes Code to reproduce experiments is provided at https://github.com/svsamsonov/richardson romberg example.
Open Datasets No We consider the problem min θ R f(θ) , f(θ) = θ2 + cos θ , with the stochastic gradient oracles F(θ, ξ) given by F(θ, ξ) = 2θ sin θ+ξ, and ξ N(0, 1).
Dataset Splits No We select different sample sizes n, choose γ = 1/ n, and construct the associated estimates θ(γ) n and θ(2γ) n . We conduct M = 320 independent parallel runs to approximate the expectations.
Hardware Specification No The paper does not provide any specific hardware details for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes We select different sample sizes n = 250 2k, where k = 0, . . . , 14, and run the SGD procedure (2) based on the constant step sizes γ and 2γ, selecting γ = 1/ n. Then for each n we compute the Richardson-Romberg estimates θ(RR) n from (36) alongside with its versions without the leading term in n, i.e. θ(RR) n + n 1 P2n k=n+1 εk+1(θ ). ... We conduct M = 320 independent parallel runs to approximate the expectations.