Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent
Authors: Ya-Chi Chu, Wenzhi Gao, Yinyu Ye, Madeleine Udell
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on deterministic convex problems show HDM with heavy-ball momentum (HDM-HB) exhibits robust performance and significantly outperforms other adaptive first-order methods. Moreover, HDM-HB often matches the performance of L-BFGS, an efficient and practical quasi-Newton method, using less memory and cheaper iterations. This section conducts numerical experiments to validate the empirical performance of hypergradient descent. We compare HDM-Best (see Section 5.1 below) with different adaptive optimization algorithms. |
| Researcher Affiliation | Academia | 1Department of Mathematics, Stanford University 2ICME, Stanford University 3Shanghai Jiao Tong University and Shanghai Institute for Mathematics and Interdisciplinary Sciences 4Management Science and Engineering, Stanford University. Correspondence to: Ya-Chi Chu <EMAIL>, Wenzhi Gao <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Hypergradient Descent Method (HDM) input initial point x1, P1 P (not necessarily PSD) for k = 1, 2,... do... Algorithm 3 HDM-Best input starting point x0 = x1, P = Sn + D, B = [0, 0.9995], initial diagonal preconditioner P1 Sn + D, initial scalar momentum parameter β1 = 0.95, Ada Grad stepsize ηp, ηb > 0, Ada Grad diagonal matrix U1 = 0, Ada Grad momentum scalar v1 = 0, τ > 0 for k = 1, 2,... do... |
| Open Source Code | Yes | This section highlights the major components of our most competitive variant HDM-Best. The algorithm and a more detailed explanation are available in Appendix A. The implementation is available at https://github.com/udellgroup/hypergrad. |
| Open Datasets | Yes | We test HDM-Best on deterministic convex problems. We adopt two convex optimization tasks in machine learning: support vector machine (Lee and Mangasarian, 2001) and logistic regression (Hastie, 2009). The testing datasets are obtained from LIBSVM (Chang and Lin, 2011). |
| Dataset Splits | No | The paper uses datasets obtained from LIBSVM but does not explicitly provide details about training/test/validation splits, exact percentages, or sample counts for reproduction. The 'Testing Configurations' section specifies maximum oracle access, initial point generation, and stopping criteria, but not dataset partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as CPU/GPU models, processor types, or memory amounts used for running experiments. Mentions of 'memory size' refer to algorithmic memory, not hardware. |
| Software Dependencies | No | The paper mentions using 'scipy' for BFGS and L-BFGS-Mk in the experimental setup, but it does not specify a version number for 'scipy' or any other software libraries used, which is necessary for reproducibility. |
| Experiment Setup | Yes | For HDM-Best, we search for the optimal ηp within {0.1/L, 1/L, 10/L, 100/L} and ηb {1, 3, 5, 10, 100}. Stepsize in GD, GD-HB, AGD-CVX, and AGD-SCVX are all set to 1/L. The Adam stepsize is chosen within the set {1/L, 10 3, 10 2, 10 1, 1, 10}. β1 = 0.9, β2 = 0.999. Maximum oracle access. We allow a maximum of 1000 gradient oracles for each algorithm. Initial point. All the algorithms are initialized from the same starting point generated from normal distribution N(0, In) and normalized to have unit length. Stopping criterion. Algorithms stop if f 10 4. |