Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity
Authors: Eduard Gorbunov, Nazarii Tupitsa, Sayantan Choudhury, Alen Aliev, Peter Richtarik, Samuel Horváth, Martin Takáč
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our numerical experiments, we focus on a simple 1-dimensional problem that is convex, (L0, L1)smooth, and provides additional insights to the ones presented in the literature. In particular, we consider function f(x) = x4, which is convex, (4, 3)-smooth, but not L-smooth as illustrated in Example 1.1. We run (i) GD with stepsize 1/L, L = 12|x0|2 (which corresponds to the worst-case smoothness constant on the interval |x| |x0|), (ii) (L0, L1)-GD with L0 = 4, L1 = 3, η = ν/2, (iii) (L0, L1)-STM with Gk+1 = L0 + L1 f(xk+1) (not supported by our theory) and (iv) with Gk+1 = max{Gk, L0 + L1 f(xk+1) } (called (L0, L1)-STM-max on the plots), (v) GD-PS, and (vi) Ad GD for starting points x0 {1, 10, 100}. The results are reported in Figure 1. In all tests, GD-PS and Ad GD show the best results among other methods (which is expected since these methods are the only parameter-free methods). Next, standard GD is the slowest among other methods and slow-downs once we move the starting point further from the optimum, which is also expected since L increases and we have to use smaller stepsizes for GD. Finally, let us discuss the behavior of (L0, L1)-GD, (L0, L1)-STM-max, and (L0, L1)-STM. Clearly, it depends on the distance from the starting point to the solution. In particular, when x0 = 1 we have f(x0) = 4, meaning that L = 16. In this case, GD and (L0, L1)-GD behave similarly to each other, and (L0, L1)-STM-max significantly outperforms both of them, which is well-aligned with the derived bounds. However, for x0 = 10 and x0 = 100 we have f(x0) = 4 103 and f(x0) = 4 106 leading to a significant slow down in the convergence of GD and (L0, L1)-STM-max. In particular, (L0, L1)-GD achieves a similar optimization error to (L0, L1)-STM-max for x0 = 10 and much better optimization error for x0 = 100. This is also aligned with our theoretical results: when R0 is large and number of iterations is not too large, bound (15) derived for (L0, L1)-GD can be better than bound (23) derived for (L0, L1)-STM-max. Moreover, for x0 = 100, Figure 1 illustrates well the two-stages convergence behavior of (L0, L1)-GD described in Theorem 3.1. Finally, although our theory does not provide any guarantees for (L0, L1)-STM with Gk+1 = L0 + L1 f(xk+1) , this method converges faster than (L0, L1)-GD for the considered problem but exhibits highly non-monotone behavior. |
| Researcher Affiliation | Academia | Eduard Gorbunov MBZUAI Nazarii Tupitsa MBZUAI Innopolis University Sayantan Choudhury Johns Hopkins University Alen Aliev MBZUAI Peter Richtárik KAUST Samuel Horváth MBZUAI Martin Takáˇc MBZUAI |
| Pseudocode | Yes | Algorithm 1 (L0, L1)-Gradient Descent ((L0, L1)-GD) Input: starting point x0, number of iterations N, stepsize parameter η > 0, L0 > 0, L1 0 1: for k = 0, 1, . . . , N 1 do 2: xk+1 = xk η L0+L1 f(xk) f(xk) 3: end for Output: x N |
| Open Source Code | No | The paper mentions the use of PEPit (Goujaud et al., 2024), which is a third-party tool, but does not provide an explicit statement or link for the authors' own implementation code. |
| Open Datasets | Yes | We also run the considered methods for real datasets from LIBSVM (Chang & Lin, 2011) a9a and mushrooms for different starting points. |
| Dataset Splits | No | The paper mentions using 'real datasets from LIBSVM (Chang & Lin, 2011) a9a and mushrooms' but does not provide specific details on how these datasets were split into training, validation, or test sets. |
| Hardware Specification | No | The paper describes numerical experiments on synthetic and real datasets but does not specify any particular hardware used (e.g., GPU models, CPU types, or memory). |
| Software Dependencies | No | The paper mentions 'PEPit (Goujaud et al., 2024)' as a tool used for analysis ('our preliminary computer-aided analysis using PEPit'), but it does not specify any version numbers for this software or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We run (i) GD with stepsize 1/L, L = 12|x0|2 (which corresponds to the worst-case smoothness constant on the interval |x| |x0|), (ii) (L0, L1)-GD with L0 = 4, L1 = 3, η = ν/2, (iii) (L0, L1)-STM with Gk+1 = L0 + L1 f(xk+1) (not supported by our theory) and (iv) with Gk+1 = max{Gk, L0 + L1 f(xk+1) } (called (L0, L1)-STM-max on the plots), (v) GD-PS, and (vi) Ad GD for starting points x0 {1, 10, 100}. |