Toward a Unified Theory of Gradient Descent under Generalized Smoothness
Authors: Alexander Tyurin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify our theoretical results by asking whether it is necessary to use the step size rule from Algorithm 1, and maybe it is sufficient to use the step size rules by Li et al. (2024a); Vankov et al. (2024) instead. We first take the function f : [0, 0.1] R { } defined as f(x) = log x log(0.1 x) (see Figure 2), which has its minimum at x = 0.05. This function is (ρ, 800, 2) smooth with ρ = 2; however, it is not (L0, L1) smooth for any L0, L1 0. Consequently, we run Algorithm 1 with ℓ(s) = 800 + 2s2, starting at x0 = 10 7, and observe that it converges4 after 75 iterations. Next, we take the step size γk = 1/(800 + 2(2f (x0))2) from (Li et al., 2024a) and observe that GD requires at least 20.000 iterations because f (x0) is huge. Finally, to verify whether the exponent 2 is necessary, we take ℓ(s) = 800 + 2s (Vankov et al., 2024). For this choice of ℓ, GD diverges. |
| Researcher Affiliation | Academia | 1AIRI, Moscow, Russia 2Skoltech, Moscow, Russia. Correspondence to: Alexander Tyurin <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Gradient Descent (GD) with ℓ-Smoothness |
| Open Source Code | No | The paper does not provide an explicit statement of code release, a link to a repository, or mention of code in supplementary materials for the described methodology. Figure 1 shows a Python function for step size computation, but it's a code snippet within the paper, not a release statement for the entire work. |
| Open Datasets | No | The paper uses synthetic mathematical functions (e.g., f(x) = log x log(0.1 x) and f(x) = ex + e1 x) for its experiments, rather than external datasets, and thus does not provide access information for any public datasets. |
| Dataset Splits | No | The paper's experiments are conducted on synthetic mathematical functions, not datasets, so there are no dataset splits to describe. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | Yes | Figure 1: Python function to compute the step sizes {γk} using SciPy (Virtanen et al., 2020). |
| Experiment Setup | Yes | We first take the function f : [0, 0.1] R { } defined as f(x) = log x log(0.1 x) (see Figure 2), which has its minimum at x = 0.05. This function is (ρ, 800, 2) smooth with ρ = 2; however, it is not (L0, L1) smooth for any L0, L1 0. Consequently, we run Algorithm 1 with ℓ(s) = 800 + 2s2, starting at x0 = 10 7, and observe that it converges4 after 75 iterations. Next, we take the step size γk = 1/(800 + 2(2f (x0))2) from (Li et al., 2024a) and observe that GD requires at least 20.000 iterations because f (x0) is huge. Finally, to verify whether the exponent 2 is necessary, we take ℓ(s) = 800 + 2s (Vankov et al., 2024). For this choice of ℓ, GD diverges.4finds x such that f( x) f(x ) 10 5 |