Asymptotic Analysis of Conditioned Stochastic Gradient Descent

Authors: Rémi Leluc, François Portier

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For the sake of completeness and illustrative purposes, we compare the performance of classical stochastic gradient descent (sgd) and the conditionned variant (csgd) presented in Appendix B where the matrix Φk is an averaging of past Hessian estimates as given in Equation (22). We shall compare equal weights ωj,k = (k + 1) 1 and adaptive weights ωj,k exp( η θj θk 1) with η > 0 to give more importance to Hessian estimates associated to iterates which are closed to the current point. Furthermore, for computational reason, we consider a novel adaptive stochastic first-order method which is a variant of Adagrad. Starting from the null vector θ0 = (0, . . . , 0) Rd, we use optimal learning rate of the form γk = α/(k + k0) (Bottou et al., 2018) and set λ(m) k 0, λ(M) k = Λk in the experiments where γ, k0 and Λ are tuned using a grid search. The means of the optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )], obtained over 100 independent runs, are presented in Figures below.
Researcher Affiliation Academia Rémi Leluc EMAIL CMAP, École Polytechnique Institut Polytechnique de Paris, Palaiseau (France) François Portier EMAIL CREST, ENSAI École Nationale de la Statistique et de l Analyse de l Information, Rennes (France)
Pseudocode No The paper describes algorithms using mathematical equations and textual descriptions, such as "θk+1 = θk γk+1Ckg(θk, ξk+1), k 0,". However, it does not include a distinct block explicitly labeled as "Pseudocode" or "Algorithm" with structured steps.
Open Source Code No The paper mentions "implemented in widely used programming tools (Pedregosa et al., 2011; Abadi et al., 2016)", referring to third-party software like scikit-learn and TensorFlow. However, it does not provide any explicit statement about releasing its own source code for the methodology described, nor does it include a link to a code repository.
Open Datasets Yes Real-world data. We now turn our attention to real-world data and consider again the Ridge regression problem on the following datasets: Boston Housing dataset (Harrison Jr & Rubinfeld, 1978) (n = 506; d = 14) and Diabetes dataset (Dua & Graff, 2017) (n = 442; d = 10).
Dataset Splits No The paper mentions using "simulated data" and "real-world data" (Boston Housing, Diabetes datasets) and states, "We use a batch-size equal to |B| = 16." This specifies the mini-batch size but does not provide information regarding how the datasets were split into training, validation, or test sets for reproduction.
Hardware Specification No The paper does not explicitly describe any specific hardware (e.g., GPU models, CPU types, memory specifications) used for running its experiments.
Software Dependencies No The paper mentions "widely used programming tools (Pedregosa et al., 2011; Abadi et al., 2016)" such as scikit-learn and TensorFlow in a general context. However, it does not provide a list of specific software dependencies, libraries, or frameworks with their version numbers that were used for the authors' implementation.
Experiment Setup Yes Starting from the null vector θ0 = (0, . . . , 0) Rd, we use optimal learning rate of the form γk = α/(k + k0) (Bottou et al., 2018) and set λ(m) k 0, λ(M) k = Λk in the experiments where γ, k0 and Λ are tuned using a grid search. The means of the optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )], obtained over 100 independent runs, are presented in Figures below. We use a batch-size equal to |B| = 16.