Asymptotic Analysis of Conditioned Stochastic Gradient Descent
Authors: Rémi Leluc, François Portier
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For the sake of completeness and illustrative purposes, we compare the performance of classical stochastic gradient descent (sgd) and the conditionned variant (csgd) presented in Appendix B where the matrix Φk is an averaging of past Hessian estimates as given in Equation (22). We shall compare equal weights ωj,k = (k + 1) 1 and adaptive weights ωj,k exp( η θj θk 1) with η > 0 to give more importance to Hessian estimates associated to iterates which are closed to the current point. Furthermore, for computational reason, we consider a novel adaptive stochastic first-order method which is a variant of Adagrad. Starting from the null vector θ0 = (0, . . . , 0) Rd, we use optimal learning rate of the form γk = α/(k + k0) (Bottou et al., 2018) and set λ(m) k 0, λ(M) k = Λk in the experiments where γ, k0 and Λ are tuned using a grid search. The means of the optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )], obtained over 100 independent runs, are presented in Figures below. |
| Researcher Affiliation | Academia | Rémi Leluc EMAIL CMAP, École Polytechnique Institut Polytechnique de Paris, Palaiseau (France) François Portier EMAIL CREST, ENSAI École Nationale de la Statistique et de l Analyse de l Information, Rennes (France) |
| Pseudocode | No | The paper describes algorithms using mathematical equations and textual descriptions, such as "θk+1 = θk γk+1Ckg(θk, ξk+1), k 0,". However, it does not include a distinct block explicitly labeled as "Pseudocode" or "Algorithm" with structured steps. |
| Open Source Code | No | The paper mentions "implemented in widely used programming tools (Pedregosa et al., 2011; Abadi et al., 2016)", referring to third-party software like scikit-learn and TensorFlow. However, it does not provide any explicit statement about releasing its own source code for the methodology described, nor does it include a link to a code repository. |
| Open Datasets | Yes | Real-world data. We now turn our attention to real-world data and consider again the Ridge regression problem on the following datasets: Boston Housing dataset (Harrison Jr & Rubinfeld, 1978) (n = 506; d = 14) and Diabetes dataset (Dua & Graff, 2017) (n = 442; d = 10). |
| Dataset Splits | No | The paper mentions using "simulated data" and "real-world data" (Boston Housing, Diabetes datasets) and states, "We use a batch-size equal to |B| = 16." This specifies the mini-batch size but does not provide information regarding how the datasets were split into training, validation, or test sets for reproduction. |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware (e.g., GPU models, CPU types, memory specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions "widely used programming tools (Pedregosa et al., 2011; Abadi et al., 2016)" such as scikit-learn and TensorFlow in a general context. However, it does not provide a list of specific software dependencies, libraries, or frameworks with their version numbers that were used for the authors' implementation. |
| Experiment Setup | Yes | Starting from the null vector θ0 = (0, . . . , 0) Rd, we use optimal learning rate of the form γk = α/(k + k0) (Bottou et al., 2018) and set λ(m) k 0, λ(M) k = Λk in the experiments where γ, k0 and Λ are tuned using a grid search. The means of the optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )], obtained over 100 independent runs, are presented in Figures below. We use a batch-size equal to |B| = 16. |