Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson–Romberg Extrapolation
Authors: Marina Sheshukova, Denis Belomestny, Alain Oliviero Durmus, Eric Moulines, Aleksei Naumov, Sergey Samsonov
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 NUMERICAL RESULTS In this section we illustrate numerically the scale of the second-order terms in equation (41) in Corollary 7. We show that, for a particular minimization problem, setting γ = n 1/2, we achieve the scaling of the second-order terms in root-MSE bounds of order O(n 3/4). We consider the problem min θ R f(θ) , f(θ) = θ2 + cos θ , with the stochastic gradient oracles F(θ, ξ) given by F(θ, ξ) = 2θ sin θ+ξ, and ξ N(0, 1). This example satisfies the assumptions A1, A2, A3(p) with any p 2. We select different sample sizes n, choose γ = 1/ n, and construct the associated estimates θ(γ) n and θ(2γ) n . Detailed description of the experimental setting is provided in Appendix E. Then for each n we compute the Richardson-Romberg estimates θ(RR) n from (36) alongside with its versions without the leading term in n, i.e. θ(RR) n + n 1 P2n k=n+1 εk+1(θ ). We provide first the plot for θ(RR) n θ 2 and θ(RR) n +n 1 P2n k=n+1 εk+1(θ ) θ 2, averaged over M = 320 parallel runs, in Figure 1. On the same figure we also provide the plots for rescaled errors n θ(RR) n θ 2 and n3/2 θ(RR) n θ + n 1 X2n k=n+1 εk+1(θ ) 2 , also averaged over M parallel runs. The corresponding plot indicates that the proper scaling of the squared norm of the remainder part is n 3/2, that is, the corresponding term in root-MSE bound for E1/2 ν [ θ(RR) n θ 2] scales as O(n 3/4), as predicted by Corollary 7. Figure 1: Left picture: Richardson-Romberg experimental error with and without the leading term 1 n P2n k=n+1 εk+1(θ ). Right picture: same errors after rescaling by n and n3/2, respectively. |
| Researcher Affiliation | Academia | Marina Sheshukova1,5 Denis Belomestny2,1 Alain Durmus 3 Eric Moulines3,4 Alexey Naumov1,6 Sergey Samsonov1 1HSE University 2Duisburg-Essen University 3CMAP, UMR 7641, Ecole Polytechnique 4Mohamed Bin Zayed University of AI 5 Skolkovo Institute of Science and Technology 6 Steklov Mathematical Institute of Russian Academy of Sciences EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes methods using mathematical equations and derivations, such as (11) and (36), but does not contain a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code to reproduce experiments is provided at https://github.com/svsamsonov/richardson romberg example. |
| Open Datasets | No | We consider the problem min θ R f(θ) , f(θ) = θ2 + cos θ , with the stochastic gradient oracles F(θ, ξ) given by F(θ, ξ) = 2θ sin θ+ξ, and ξ N(0, 1). |
| Dataset Splits | No | We select different sample sizes n, choose γ = 1/ n, and construct the associated estimates θ(γ) n and θ(2γ) n . We conduct M = 320 independent parallel runs to approximate the expectations. |
| Hardware Specification | No | The paper does not provide any specific hardware details for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | We select different sample sizes n = 250 2k, where k = 0, . . . , 14, and run the SGD procedure (2) based on the constant step sizes γ and 2γ, selecting γ = 1/ n. Then for each n we compute the Richardson-Romberg estimates θ(RR) n from (36) alongside with its versions without the leading term in n, i.e. θ(RR) n + n 1 P2n k=n+1 εk+1(θ ). ... We conduct M = 320 independent parallel runs to approximate the expectations. |