Automatic Differentiation of Optimization Algorithms with Time-Varying Updates

Authors: Sheheryar Mehmood, Peter Ochs

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To test our results, we provide numerical demonstration on a few examples from classical Machine Learning. These include lasso regression, that is, ... We solve the three problems through PGD with four different choices of step sizes and APG with fixed step size and βk := (k − 1)/(k +5) (depicted by APG in Figure 1). ... In Figure 1, the top row shows the median error plots of the five algorithms and the bottom row shows the errors of the corresponding derivatives with the same colour.
Researcher Affiliation Academia 1Department of Mathematics & Computer Science, Saarland University, Saarbr ucken, Germany. Correspondence to: Sheheryar Mehmood <EMAIL>.
Pseudocode Yes Algorithm 1 Proximal Gradient with Extrapolation. Initialization: x(0) = x(-1) ∈ X, u ∈ U, 0 < α_ <= α < 2/L. Parameter: (αk)k∈N ∈ [α_, α] and (βk)k∈N ∈ [0, 1]. Update k ≥ 0: y(k) := (1 + βk)x(k) − βkx(k-1) w(k) := y(k) − αk ∇xf(y(k), u) x(k+1) := Pαkg(w(k), u).
Open Source Code No The paper mentions autograd libraries like PyTorch, TensorFlow, and JAX as tools used, but does not provide specific access to the authors' own implementation code for the methodology described.
Open Datasets Yes We solve (16) for 50 randomly generated datasets, (17) for 50 perturbed instances of MADELON dataset (Dua & Graff, 2017), and (18) for a single instance of CIFAR10 dataset (Krizhevsky, 2009).
Dataset Splits No For (17), we use MADELON dataset with M = 2, 000 samples and N = 501 features. ... For (18), we use CIFAR10 dataset with M = 50, 000 samples N = 32 × 32 × 3 features. The paper specifies the total number of samples for these datasets but does not provide specific training/validation/test splits.
Hardware Specification No No specific hardware details (like GPU/CPU models or memory) are provided in the paper for running the experiments.
Software Dependencies No A crucial advantage of AD is that it provides a nice blackbox implementation thanks to the powerful autograd libraries included in Py Torch (Paszke et al., 2019), Tensor Flow (Abadi et al., 2016), and JAX (Bradbury et al., 2018). While these software packages are mentioned, no specific version numbers are provided for their usage in the experiments.
Experiment Setup Yes We solve the three problems through PGD with four different choices of step sizes and APG with fixed step size and βk := (k − 1)/(k +5) (depicted by APG in Figure 1). ... For each problem, we run PGD with four different choices of step size, namely, (i) αk = 2/(L + m) for (17) and αk = 1/L for (16), (ii) αk ∈ U(0, 2/3L), (iii) αk ∈ U(2/3L, 4/3L), and (iv) αk ∈ U(4/3L, 2/L), for each k ∈ N. We also run APG with αk = 1/L and βk = (k − 1)/(k + 5). Before starting each algorithm, we obtain w(0) ∈ B10−2(w∗) by partially solving each problem through APG.