reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging

Authors: Shusen Wang, Alex Gittens, Michael W. Mahoney

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evaluations bear out these theoretical results. In particular, in Section 4, we show in Figure 3 that even when the regularization parameter γ is ﬁne-tuned, the risks of classical and Hessian sketch are worse than that of the optimal solution by an order of magnitude. We conduct experiments on synthetic data to verify our theory. Sections 4 and 5 conduct experiments to verify our theories and demonstrates the eﬃcacy of model averaging. We tested the prediction performance of sketched ridge regression by implementing classical sketch with model averaging in Py Spark (Zaharia et al., 2010).
Researcher Affiliation	Academia	Shusen Wang EMAIL International Computer Science Institute and Department of Statistics University of California at Berkeley Berkeley, CA 94720, USA Alex Gittens EMAIL Computer Science Department Rensselaer Polytechnic Institute Troy, NY 12180, USA Michael W. Mahoney EMAIL International Computer Science Institute and Department of Statistics University of California at Berkeley Berkeley, CA 94720, USA
Pseudocode	No	The paper describes methods and theoretical analyses, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like steps for any procedure.
Open Source Code	Yes	The code is available at https://github.com/wangshusen/Sketched Ridge Regression.git
Open Datasets	Yes	We use the Million Song Year Prediction data set, which has 463, 715 training samples and 51, 630 test samples with 90 features and one response.
Dataset Splits	Yes	We use the Million Song Year Prediction data set, which has 463, 715 training samples and 51, 630 test samples with 90 features and one response. We randomly partition the training data into g parts, which amounts to uniform row selection with sketch size s = n/g.
Hardware Specification	No	The paper states, 'We ran our experiments using Py Spark in local mode,' but it does not specify any particular hardware components such as GPU or CPU models, memory, or other detailed computer specifications used for these experiments.
Software Dependencies	No	The paper mentions implementing code 'in Py Spark (Zaharia et al., 2010)' and 'in Python', but it does not provide specific version numbers for these or any other software libraries or dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	In Figure 2, we plot the objective function value f(w) = 1/n Xw y 2 2 + γ w 2 2 against γ, under diﬀerent settings of ξ (the standard deviation of the Gaussian noise added to the response). We calculate the bias and variance bias(w ), var(w ) of the optimal MRR solution according to Theorem 4. We consider diﬀerent noise levels by setting ξ = 10^-2 or 10^-1. We use ﬁve-fold cross-validation to determine the regularization parameter γ.