Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity

Authors: Eduard Gorbunov, Nazarii Tupitsa, Sayantan Choudhury, Alen Aliev, Peter Richtarik, Samuel Horváth, Martin Takáč

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our numerical experiments, we focus on a simple 1-dimensional problem that is convex, (L0, L1)smooth, and provides additional insights to the ones presented in the literature. In particular, we consider function f(x) = x4, which is convex, (4, 3)-smooth, but not L-smooth as illustrated in Example 1.1. We run (i) GD with stepsize 1/L, L = 12|x0|2 (which corresponds to the worst-case smoothness constant on the interval |x| |x0|), (ii) (L0, L1)-GD with L0 = 4, L1 = 3, η = ν/2, (iii) (L0, L1)-STM with Gk+1 = L0 + L1 f(xk+1) (not supported by our theory) and (iv) with Gk+1 = max{Gk, L0 + L1 f(xk+1) } (called (L0, L1)-STM-max on the plots), (v) GD-PS, and (vi) Ad GD for starting points x0 {1, 10, 100}. The results are reported in Figure 1. In all tests, GD-PS and Ad GD show the best results among other methods (which is expected since these methods are the only parameter-free methods). Next, standard GD is the slowest among other methods and slow-downs once we move the starting point further from the optimum, which is also expected since L increases and we have to use smaller stepsizes for GD. Finally, let us discuss the behavior of (L0, L1)-GD, (L0, L1)-STM-max, and (L0, L1)-STM. Clearly, it depends on the distance from the starting point to the solution. In particular, when x0 = 1 we have f(x0) = 4, meaning that L = 16. In this case, GD and (L0, L1)-GD behave similarly to each other, and (L0, L1)-STM-max significantly outperforms both of them, which is well-aligned with the derived bounds. However, for x0 = 10 and x0 = 100 we have f(x0) = 4 103 and f(x0) = 4 106 leading to a significant slow down in the convergence of GD and (L0, L1)-STM-max. In particular, (L0, L1)-GD achieves a similar optimization error to (L0, L1)-STM-max for x0 = 10 and much better optimization error for x0 = 100. This is also aligned with our theoretical results: when R0 is large and number of iterations is not too large, bound (15) derived for (L0, L1)-GD can be better than bound (23) derived for (L0, L1)-STM-max. Moreover, for x0 = 100, Figure 1 illustrates well the two-stages convergence behavior of (L0, L1)-GD described in Theorem 3.1. Finally, although our theory does not provide any guarantees for (L0, L1)-STM with Gk+1 = L0 + L1 f(xk+1) , this method converges faster than (L0, L1)-GD for the considered problem but exhibits highly non-monotone behavior.
Researcher Affiliation Academia Eduard Gorbunov MBZUAI Nazarii Tupitsa MBZUAI Innopolis University Sayantan Choudhury Johns Hopkins University Alen Aliev MBZUAI Peter Richtárik KAUST Samuel Horváth MBZUAI Martin Takáˇc MBZUAI
Pseudocode Yes Algorithm 1 (L0, L1)-Gradient Descent ((L0, L1)-GD) Input: starting point x0, number of iterations N, stepsize parameter η > 0, L0 > 0, L1 0 1: for k = 0, 1, . . . , N 1 do 2: xk+1 = xk η L0+L1 f(xk) f(xk) 3: end for Output: x N
Open Source Code No The paper mentions the use of PEPit (Goujaud et al., 2024), which is a third-party tool, but does not provide an explicit statement or link for the authors' own implementation code.
Open Datasets Yes We also run the considered methods for real datasets from LIBSVM (Chang & Lin, 2011) a9a and mushrooms for different starting points.
Dataset Splits No The paper mentions using 'real datasets from LIBSVM (Chang & Lin, 2011) a9a and mushrooms' but does not provide specific details on how these datasets were split into training, validation, or test sets.
Hardware Specification No The paper describes numerical experiments on synthetic and real datasets but does not specify any particular hardware used (e.g., GPU models, CPU types, or memory).
Software Dependencies No The paper mentions 'PEPit (Goujaud et al., 2024)' as a tool used for analysis ('our preliminary computer-aided analysis using PEPit'), but it does not specify any version numbers for this software or any other software dependencies, which is required for reproducibility.
Experiment Setup Yes We run (i) GD with stepsize 1/L, L = 12|x0|2 (which corresponds to the worst-case smoothness constant on the interval |x| |x0|), (ii) (L0, L1)-GD with L0 = 4, L1 = 3, η = ν/2, (iii) (L0, L1)-STM with Gk+1 = L0 + L1 f(xk+1) (not supported by our theory) and (iv) with Gk+1 = max{Gk, L0 + L1 f(xk+1) } (called (L0, L1)-STM-max on the plots), (v) GD-PS, and (vi) Ad GD for starting points x0 {1, 10, 100}.