Dropout Training is Distributionally Robust Optimal

Authors: José Blanchet, Yang Kang, José Luis Montiel Olea, Viet Anh Nguyen, Xuhui Zhang

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 7. Numerical Experiments We conduct numerical experiments in this section to compare our preferred implementation of dropout training to stochastic gradient descent, as well as our recommended selection of δ to cross-validation. The benefits of our suggested unbiased multi-level Monte Carlo algorithm are analyzed using high-dimensional regression, whereas our selection of δ is analyzed using a low-dimensional regression model. [...] Figure 1 shows the l2 divergence from the true β n of the two algorithms for varying L, while Figure 2 and Figure 3 show the l and l1 divergence, respectively. We provide supporting evidence in Appendix A.7 to argue our choice of the learning rate, initialization, and wall-clock time, where our proposed algorithm is robust to any reasonable choices. [...] Table 1: Frequency of in-sample loss covering the true population loss.
Researcher Affiliation Academia Jos e Blanchet EMAIL Department of Management Science and Engineering Stanford University Stanford, CA 94305, USA; Yang Kang EMAIL Department of Statistics Columbia University New York, NY 10027, USA; Jos e Luis Montiel Olea EMAIL Department of Economics Cornell University Ithaca, NY 14850, USA; Viet Anh Nguyen EMAIL Department of Systems Engineering and Engineering Management Chinese University of Hong Kong Hong Kong SAR; Xuhui Zhang EMAIL Department of Management Science and Engineering Stanford University Stanford, CA 94305, USA
Pseudocode Yes 6.5 Algorithm for the Unbiased Multilevel Monte Carlo We present a parallelized version using L processors, which works even when L = 1. Parallel computing reduces the variance of the estimator, so we suggest using as many processors as are available per run. Fix an integer m0 N such that 2m0+1 2d. For each processor l = 1, . . . , L we consider the following steps. i) Take a random (integer) draw, m l , from a geometric distribution with parameter ii) Given m l , take 2K l +1 i.i.d. draws from the d-dimensional vector ξi Q 1 . . . Q d , K l m0 + m l .
Open Source Code No The paper discusses various algorithms and their implementations but does not provide any explicit statement about releasing its source code or a link to a code repository.
Open Datasets No Our simulation setting considers a linear regression model with a covariate vector having dimensionality d = 100 and sample size n = 50. We pick a known regression coefficient β0 Rd being a vector with all entries equal to 1. With fixed coefficients, we assume the covariate vector follows an independent Gaussian, as well as for the regression noise. More specifically, we can get our 50 observations (xi, yi) via sampling xi N(0, Id), i = 1, . . . , n, sampling yi R conditional on xi, where yi is given by the linear assumption and εi are i.i.d. random noise following N(0, 102), for i = 1, . . . , n.
Dataset Splits No The paper describes a simulation setting where data is programmatically generated for numerical experiments (Section 7.1). It does not involve pre-existing datasets that require explicit training/test/validation splits.
Hardware Specification Yes We run our simulation on a cluster with two Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40 GHz processors (with 10 cores each), and a total memory of 128 GB.
Software Dependencies No The paper describes algorithms and their conceptual implementation (e.g., stochastic gradient descent, multi-level Monte Carlo) but does not specify any particular programming languages, libraries, or frameworks with version numbers used for its experimental setup.
Experiment Setup Yes Standard SGD algorithm with a learning rate 0.0001 and initialization at the origin. [...] Multi-level Monte Carlo algorithm with a geometric rate r = 0.6 and a burn-in period m0 = 5. Note that in each parallel run, we use gradient descent (GD) with 0.01 learning rate and initialization at origin for steps iii) and iv) in Section 6.4.