Optimizing $(L_0, L_1)$-Smooth Functions by Gradient Methods
Authors: Daniil Vankov, Anton Rodomanov, Angelia Nedich, Lalitha Sankar, Sebastian Stich
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | G NUMERICAL RESULTS In Fig. 1, we compare the performance of the analyzed methods for solving optimization problem (1) with a function f(x) = 1 p x p. We fix L1 = 1 and choose L0 = ( p 2 L1 )p 2 according to Example A.1. For GM, we choose stepsizes according to (9), (12) and (13). For NGM, we use time-varying coefficients βk = ˆ R k+1 with different values of ˆR { 1 2R, 2R, 10R}, which allows us to study the robustness of this method to our initial guess of the unknown initial distance to the solution. Note that, for this particular problem, the choice of ˆR = R is rather special and allows the method to find the exact solution after one iteration, so we are not considering it. We observe that, NGM and PS-GM outperform GM with stepsizes from (9), (12) and (13). This can be explained by the fact that the complexity of GM depends on the particular choice of (L0, L1), while complexity of NGM and PS-GM involves the optimal parameters L0, L1 as discussed in Section 4. Moreover, closer initial distance estimation ˆR to a true value R leads to a faster convergence of NGM to a solution. In Fig. 2, we present an experiment studying the performance of the GM with the stepsize rule (9) based on the choice of (L0, L1). For each choice of L1 {1, 2, 4, 8, 16} we set L0 = ( p 2 L1 )p 2, according to Example A.1. As expected from the theory (see the corresponding discussion at the end of Section 4), the choice of (L0, L1) pair is crucial in practice for the performance of GM and depends on a target accuracy ϵ. In Fig. 3, we conduct an experiment for accelerated methods and consider GM with stepsize (9), Algorithm 1 with T( ) being the gradient update with stepsize (9), and two variants of normalized Similar Triangles Methods (STM, and STM-Max) from Gorbunov et al. (2024). STM uses normalization by the norm of the gradient at the current point in a gradient step, while STM-Max normalizes by the largest norm of the gradient over the optimization trajectory. It is worth noticing that only STM-Max has theoretical convergence guarantees. We set L1 = 1, L0 = ( p 2 L1 )p 2 (see Example A.1) with various p. We observe that for smaller values of p, Algorithm 1 outperforms STM and STM-max. |
| Researcher Affiliation | Academia | Daniil Vankov Arizona State University EMAIL Anton Rodomanov CISPA EMAIL Angelia Nedić Arizona State University EMAIL Lalitha Sankar Arizona State University EMAIL Sebastian U. Stich CISPA EMAIL |
| Pseudocode | Yes | Algorithm 1 AGMs DR 1: Input: Initial point x0 Rd, update rule T( ). 2: v0 = x0, A0 = 0, ζ0(x) = 1 2 x v0 2. 3: for k = 0, 1, . . . do 4: yk = arg miny{f(y): y = vk + β(xk vk), β [0, 1]}. 5: xk+1 = T(yk), Mk = f(yk) 2 2[f(yk) f(xk+1)] (> 0).6 6: Find ak+1 > 0 from the equation Mka2 k+1 = Ak + ak+1. Set Ak+1 = Ak + ak+1. 7: vk+1 = arg minx Rd{ζk+1(x) := ζk(x) + ak+1[f(yk) + f(yk), x yk ]}. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to code repositories. |
| Open Datasets | No | The paper conducts numerical experiments using a synthetic function f(x) = 1/p |x|^p, as described in Appendix G, rather than external, publicly available datasets. |
| Dataset Splits | No | The paper does not use external datasets, but a synthetic function for numerical experiments. Therefore, there are no dataset splits described. |
| Hardware Specification | No | The paper discusses numerical results in Appendix G but does not provide any specific details about the hardware used to run the experiments. |
| Software Dependencies | No | The paper discusses numerical results in Appendix G but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | G NUMERICAL RESULTS In Fig. 1, we compare the performance of the analyzed methods for solving optimization problem (1) with a function f(x) = 1 p x p. We fix L1 = 1 and choose L0 = ( p 2 L1 )p 2 according to Example A.1. For GM, we choose stepsizes according to (9), (12) and (13). For NGM, we use time-varying coefficients βk = ˆ R k+1 with different values of ˆR { 1 2R, 2R, 10R}, which allows us to study the robustness of this method to our initial guess of the unknown initial distance to the solution. Note that, for this particular problem, the choice of ˆR = R is rather special and allows the method to find the exact solution after one iteration, so we are not considering it. We observe that, NGM and PS-GM outperform GM with stepsizes from (9), (12) and (13). This can be explained by the fact that the complexity of GM depends on the particular choice of (L0, L1), while complexity of NGM and PS-GM involves the optimal parameters L0, L1 as discussed in Section 4. Moreover, closer initial distance estimation ˆR to a true value R leads to a faster convergence of NGM to a solution. In Fig. 2, we present an experiment studying the performance of the GM with the stepsize rule (9) based on the choice of (L0, L1). For each choice of L1 {1, 2, 4, 8, 16} we set L0 = ( p 2 L1 )p 2, according to Example A.1. As expected from the theory (see the corresponding discussion at the end of Section 4), the choice of (L0, L1) pair is crucial in practice for the performance of GM and depends on a target accuracy ϵ. In Fig. 3, we conduct an experiment for accelerated methods and consider GM with stepsize (9), Algorithm 1 with T( ) being the gradient update with stepsize (9), and two variants of normalized Similar Triangles Methods (STM, and STM-Max) from Gorbunov et al. (2024). STM uses normalization by the norm of the gradient at the current point in a gradient step, while STM-Max normalizes by the largest norm of the gradient over the optimization trajectory. It is worth noticing that only STM-Max has theoretical convergence guarantees. We set L1 = 1, L0 = ( p 2 L1 )p 2 (see Example A.1) with various p. We observe that for smaller values of p, Algorithm 1 outperforms STM and STM-max. |