A Generalization Result for Convergence in Learning-to-Optimize
Authors: Michael Sucker, Peter Ochs
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct two experiments to show the validity of our claims: We use a neural-network based iterative optimization algorithm to a) solve quadratic problems and b) to train another neural network. In both cases, the learned algorithm outperforms the baseline and converges to a critical point with high probability. |
| Researcher Affiliation | Academia | 1Department of Mathematics, University of T ubingen, T ubingen, Germany 2Department of Mathematics and Computer Science, Saarland University, Saarbr ucken, Germany. Correspondence to: Michael Sucker <EMAIL>, Peter Ochs <EMAIL>. |
| Pseudocode | No | The paper describes algorithmic updates in text and through network architecture diagrams (Figure 3 and 4) but does not contain structured pseudocode or algorithm blocks with labeled steps. |
| Open Source Code | Yes | The code to reproduce the results can be found at https: //github.com/Michi Sucker/COLA_2024. |
| Open Datasets | No | For the quadratic problems, the strong-convexity and smoothness constants of ℓare sampled randomly in the intervals [m , m+], [L , L+] (0, + ), and we define the matrix Aj, j = 1, ..., N, as diagonal matrix with entries aj ii = mj + i( p Lj mj)/d, i = 1, ..., d. For training neural networks, we use functions gi as polynomials of degree d = 5, where we sample the coefficients (ci,0, ..., ci,5) uniformly in [ 5, 5]. Similarly, we sample the points {xi,j}K j=1 uniformly in [ 2, 2]. All datasets are generated synthetically or according to specified rules rather than being publicly available external datasets. |
| Dataset Splits | Yes | For the construction of the prior, we use Nprior = 500 functions, for the probabilistic constraint we use Nval = 500 functions, and for the PAC-Bayesian optimization step we use Ntrain = 250 functions, all of which are sampled i.i.d., that is, the data sets are independent of each other. The right plot shows the estimated probabilities ... on 250 test sets of size N = 250. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments. It only mentions general experimental procedures. |
| Software Dependencies | No | As baseline we use Adam (Kingma & Ba, 2015) as it is implemented in Py Torch (Paszke et al., 2019). Although PyTorch is mentioned, a specific version number is not provided, which is necessary for reproducibility. |
| Experiment Setup | Yes | Thus, we restrict to ttrain = 500 iterations. Finally, we consider ξ to be converged, if the loss is smaller than 10 16. Adam (Kingma & Ba, 2015)... we tune its step-size with a simple grid search over 100 values in [10 4, 10 2], such that its performance is best for the given ttrain = 250 iterations. This yields the value κ = 0.008. We run this procedure for 150 103 iterations. |