reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Implicit Differentiation for Fast Hyperparameter Selection in Non-Smooth Convex Learning

Authors: Quentin Bertrand, Quentin Klopfenstein, Mathurin Massias, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, Joseph Salmon

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide extensive experiments on diverse data sets and estimators (Section 4). We first show that implicit differentiation significantly outperforms other hypergradient methods (Section 4.1). Then, leveraging sparsity, we illustrate computational benefits of first-order optimization with respect to zero-order techniques for solving Problem (2) on Lasso, elastic net and multiclass logistic regression (Section 4.2).
Researcher Affiliation	Collaboration	Quentin Bertrand EMAIL Université Paris-Saclay, Inria, CEA, Palaiseau, France Quentin Klopfenstein EMAIL Institut Mathématique de Bourgogne, Université de Bourgogne, Dijon, France Mathurin Massias EMAIL Ma LGa, DIBRIS, Università degli Studi di Genova, Genova, Italy Mathieu Blondel EMAIL Google Research, Brain team, Paris, France Samuel Vaiter EMAIL CNRS and Institut Mathématique de Bourgogne, Université de Bourgogne, Dijon, France Alexandre Gramfort EMAIL Université Paris-Saclay, Inria, CEA, Palaiseau, France Joseph Salmon EMAIL IMAG, Université de Montpellier, CNRS, Montpellier, France
Pseudocode	Yes	Algorithm 1 Forward-mode PGD... Algorithm 2 Reverse-mode PGD... Algorithm 3 Forward-mode PCD... Algorithm 4 Reverse-mode PCD... Algorithm 5 Implicit differentiation... Algorithm 6 Gradient descent with approximate gradient...
Open Source Code	Yes	We release our implementation as a high-quality, documented and tested Python package: https://github.com/qb3/sparse-ho.
Open Datasets	Yes	Table 5: Characteristics of the data sets used for the experiments. ... Data available on the libsvm website: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Dataset Splits	Yes	In the Lasso and elastic net experiments, we pick a K-fold CV loss as outer criterion6. Hence, the data set (X, y) is partitioned into K hold-out data sets (Xtraink, ytraink), (Xvalk, yvalk).6 In our experiments the default choice is K = 5. ... With a single train/test split, the bilevel problem to solve writes: arg min λ (λ1,...,λq) Rq C ˆβ(λ1), . . . , ˆβ(λq); Xtest, ytest s.t. ˆβ(λk) arg min β Rp ψk(β, λk; Xtrain, ytrain) k [q] .
Hardware Specification	No	Our package, sparse-ho, is implemented in Python. It relies on Numpy (Harris et al., 2020), Numba (Lam et al., 2015) and Sci Py (Virtanen et al., 2020).
Software Dependencies	No	Our package, sparse-ho, is implemented in Python. It relies on Numpy (Harris et al., 2020), Numba (Lam et al., 2015) and Sci Py (Virtanen et al., 2020). ... We used the hyperopt implementation (Bergstra et al., 2013). ... we used Celer (Massias et al., 2020) for the Lasso, and Lightning (Blondel and Pedregosa, 2016) for the SVM.
Experiment Setup	Yes	In order to aim for good numerical precision, problems were solved up to a duality gap of 10-6 for the forward-mode and the reverse-mode. ... we used the default tolerance of 10-4. ... For the Lasso we chose a grid of 100 hyperparameters λ, uniformly spaced between λmax ln(104) and λmax. ... For the elastic net we chose for each of the two hyperparameters a grid of 10 values uniformly spaced between λmax and λmax ln(104). ... Random-search: we chose 30 values of λ sampled uniformly between λmax and λmax ln(104) for each hyperparameter. ... SMBO: ... It evaluates L using 5 values of λ, chosen uniformly at random between λmax and λmax ln(104). ... 1st order: first-order method with exact gradient (Algorithm 6 with constant tolerances ϵi = 10-6), with λmax ln(102) as a starting point. 1st order approx: a first-order method using approximate gradient (Algorithm 6 with tolerances ϵi, geometrically decreasing from 10-2 to 10-6), with λmax ln(102) as a starting point.