Implicit Differentiation for Fast Hyperparameter Selection in Non-Smooth Convex Learning

Authors: Quentin Bertrand, Quentin Klopfenstein, Mathurin Massias, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, Joseph Salmon

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide extensive experiments on diverse data sets and estimators (Section 4). We first show that implicit differentiation significantly outperforms other hypergradient methods (Section 4.1). Then, leveraging sparsity, we illustrate computational benefits of first-order optimization with respect to zero-order techniques for solving Problem (2) on Lasso, elastic net and multiclass logistic regression (Section 4.2).
Researcher Affiliation Collaboration Quentin Bertrand EMAIL Université Paris-Saclay, Inria, CEA, Palaiseau, France Quentin Klopfenstein EMAIL Institut Mathématique de Bourgogne, Université de Bourgogne, Dijon, France Mathurin Massias EMAIL Ma LGa, DIBRIS, Università degli Studi di Genova, Genova, Italy Mathieu Blondel EMAIL Google Research, Brain team, Paris, France Samuel Vaiter EMAIL CNRS and Institut Mathématique de Bourgogne, Université de Bourgogne, Dijon, France Alexandre Gramfort EMAIL Université Paris-Saclay, Inria, CEA, Palaiseau, France Joseph Salmon EMAIL IMAG, Université de Montpellier, CNRS, Montpellier, France
Pseudocode Yes Algorithm 1 Forward-mode PGD... Algorithm 2 Reverse-mode PGD... Algorithm 3 Forward-mode PCD... Algorithm 4 Reverse-mode PCD... Algorithm 5 Implicit differentiation... Algorithm 6 Gradient descent with approximate gradient...
Open Source Code Yes We release our implementation as a high-quality, documented and tested Python package: https://github.com/qb3/sparse-ho.
Open Datasets Yes Table 5: Characteristics of the data sets used for the experiments. ... Data available on the libsvm website: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Dataset Splits Yes In the Lasso and elastic net experiments, we pick a K-fold CV loss as outer criterion6. Hence, the data set (X, y) is partitioned into K hold-out data sets (Xtraink, ytraink), (Xvalk, yvalk).6 In our experiments the default choice is K = 5. ... With a single train/test split, the bilevel problem to solve writes: arg min λ (λ1,...,λq) Rq C ˆβ(λ1), . . . , ˆβ(λq); Xtest, ytest s.t. ˆβ(λk) arg min β Rp ψk(β, λk; Xtrain, ytrain) k [q] .
Hardware Specification No Our package, sparse-ho, is implemented in Python. It relies on Numpy (Harris et al., 2020), Numba (Lam et al., 2015) and Sci Py (Virtanen et al., 2020).
Software Dependencies No Our package, sparse-ho, is implemented in Python. It relies on Numpy (Harris et al., 2020), Numba (Lam et al., 2015) and Sci Py (Virtanen et al., 2020). ... We used the hyperopt implementation (Bergstra et al., 2013). ... we used Celer (Massias et al., 2020) for the Lasso, and Lightning (Blondel and Pedregosa, 2016) for the SVM.
Experiment Setup Yes In order to aim for good numerical precision, problems were solved up to a duality gap of 10-6 for the forward-mode and the reverse-mode. ... we used the default tolerance of 10-4. ... For the Lasso we chose a grid of 100 hyperparameters λ, uniformly spaced between λmax ln(104) and λmax. ... For the elastic net we chose for each of the two hyperparameters a grid of 10 values uniformly spaced between λmax and λmax ln(104). ... Random-search: we chose 30 values of λ sampled uniformly between λmax and λmax ln(104) for each hyperparameter. ... SMBO: ... It evaluates L using 5 values of λ, chosen uniformly at random between λmax and λmax ln(104). ... 1st order: first-order method with exact gradient (Algorithm 6 with constant tolerances ϵi = 10-6), with λmax ln(102) as a starting point. 1st order approx: a first-order method using approximate gradient (Algorithm 6 with tolerances ϵi, geometrically decreasing from 10-2 to 10-6), with λmax ln(102) as a starting point.