reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WONDER: Weighted One-shot Distributed Ridge Regression in High Dimensions

Authors: Edgar Dobriban, Yue Sheng

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test WONDER in simulation studies and using the Million Song Dataset as an example. There it can save at least 100x in computation time, while nearly preserving test accuracy. Keywords: distributed learning, ridge regression, high-dimensional statistics, random matrix theory... We provide numerical simulations throughout the paper, and additional ones in Section 6, along with an example using an empirical data set.
Researcher Affiliation	Academia	Edgar Dobriban EMAIL Wharton Statistics Department University of Pennsylvania Philadelphia, PA 19104, USA; Yue Sheng EMAIL Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104, USA
Pseudocode	Yes	Algorithm 1: WONDER: Weighted ONe-shot Distribut Ed Ridge regression algorithm, general design... Algorithm 2: WONDER: Weighted ONe-shot Distribut Ed Ridge regression algorithm, isotropic design
Open Source Code	Yes	The code for our paper is available at github.com/dobriban/dist_ridge.
Open Datasets	Yes	We test WONDER in simulation studies and using the Million Song Dataset as an example.... Figure 10: Million Song Year Prediction Dataset (MSD). Optimal weighted average (WONDER), Naive average, and regression on 1/k fraction of data.... Speciﬁcally, we perform the following steps in our data analysis. We download the data set from the UC Irvine Machine Learning Repository. The original data set has N = 515, 345 samples and p = 91 features.
Dataset Splits	Yes	The data set has already been divided into a training set and a test set. The training set consists of the ﬁrst 463, 715 samples and the test set contains the rest. We attempt to predict the release year of a song. Before doing distributed regression, we ﬁrst center and normalize both the design matrix X and the outcome Y . Now we are ready to do ridge regression under the distributed setting. For each experiment, we randomly choose ntrain = 10, 000 samples from the training set and ntest = 1, 000 samples from the test set.
Hardware Specification	No	The paper does not provide specific hardware details used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	For each group, we choose the same tuning parameter λi = p/(niα2). For the global regression on the entire data set, we choose the tuning parameter λ = p/(nα2) optimally.... We set all local regularization parameters to equal values, which is reasonable, since the local problems are exchangeable. We also parametrize the regularization parameters as multiples of the optimal parameter for the isotropic case (which equals kγ/α2).... We try diﬀerent tuning parameters λ around kp/(ntrain ˆα2), and use λ = 3kp/(ntrain ˆα2) as our ﬁnal parameter. (In practice, one may try a 1-D grid search to ﬁnd the right scale.)