A Generalization Result for Convergence in Learning-to-Optimize

Authors: Michael Sucker, Peter Ochs

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct two experiments to show the validity of our claims: We use a neural-network based iterative optimization algorithm to a) solve quadratic problems and b) to train another neural network. In both cases, the learned algorithm outperforms the baseline and converges to a critical point with high probability.
Researcher Affiliation Academia 1Department of Mathematics, University of T ubingen, T ubingen, Germany 2Department of Mathematics and Computer Science, Saarland University, Saarbr ucken, Germany. Correspondence to: Michael Sucker <EMAIL>, Peter Ochs <EMAIL>.
Pseudocode No The paper describes algorithmic updates in text and through network architecture diagrams (Figure 3 and 4) but does not contain structured pseudocode or algorithm blocks with labeled steps.
Open Source Code Yes The code to reproduce the results can be found at https: //github.com/Michi Sucker/COLA_2024.
Open Datasets No For the quadratic problems, the strong-convexity and smoothness constants of ℓare sampled randomly in the intervals [m , m+], [L , L+] (0, + ), and we define the matrix Aj, j = 1, ..., N, as diagonal matrix with entries aj ii = mj + i( p Lj mj)/d, i = 1, ..., d. For training neural networks, we use functions gi as polynomials of degree d = 5, where we sample the coefficients (ci,0, ..., ci,5) uniformly in [ 5, 5]. Similarly, we sample the points {xi,j}K j=1 uniformly in [ 2, 2]. All datasets are generated synthetically or according to specified rules rather than being publicly available external datasets.
Dataset Splits Yes For the construction of the prior, we use Nprior = 500 functions, for the probabilistic constraint we use Nval = 500 functions, and for the PAC-Bayesian optimization step we use Ntrain = 250 functions, all of which are sampled i.i.d., that is, the data sets are independent of each other. The right plot shows the estimated probabilities ... on 250 test sets of size N = 250.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments. It only mentions general experimental procedures.
Software Dependencies No As baseline we use Adam (Kingma & Ba, 2015) as it is implemented in Py Torch (Paszke et al., 2019). Although PyTorch is mentioned, a specific version number is not provided, which is necessary for reproducibility.
Experiment Setup Yes Thus, we restrict to ttrain = 500 iterations. Finally, we consider ξ to be converged, if the loss is smaller than 10 16. Adam (Kingma & Ba, 2015)... we tune its step-size with a simple grid search over 100 values in [10 4, 10 2], such that its performance is best for the given ttrain = 250 iterations. This yields the value κ = 0.008. We run this procedure for 150 103 iterations.