reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalized Sparse Additive Models

Authors: Asad Haris, Noah Simon, Ali Shojaie

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We complement our theoretical results with empirical studies comparing some existing methods within this framework. Keywords: Generalized Additive Models, Sparsity, Minimax, High-Dimensional, Penalized Regression 5. Simulation Study In this section, to complement our theoretical results, we conduct a simulation study to study the ﬁnite sample performance of various GSAMs as a function of n. 6. Data Analysis 6.1 Boston Housing Data
Researcher Affiliation	Academia	Asad Haris EMAIL Department of Earth, Ocean and Atmospheric Sciences University of British Columbia 2020 2207 Main Mall Vancouver, BC, Canada V6T 1Z4 Noah Simon EMAIL Ali Shojaie EMAIL Department of Biostatistics University of Washington Seattle, WA 98195-7232, USA
Pseudocode	Yes	Algorithm 1 General Proximal Gradient Algorithm for (3) Algorithm A.1 Block Coordinate Descent for Least Squares Loss
Open Source Code	Yes	The R package GSAM, available on https://github.com/asadharis/GSAM, implements the methods described in this paper.
Open Datasets	Yes	6.1 Boston Housing Data We use the methods of Section 5 to predict the value of owner-occupied homes in the suburbs of Boston using census data from 1970. ... As done in the data analysis by Ravikumar et al. (2009), we add 10 noise covariates uniformly generated on the unit interval and 10 additional noise covariates obtained by randomly permuting the original covariates. 6.2 Gene Expression Data We used the Curated Microarray Database (Cu Mi Da) (Feltes et al., 2019): a repository of gene-expression data sets curated from the Gene Expression Omnibus (GEO). ... 1. Lung: ... accession number GSE19804. ... 2. Prostate: ... accession number GSE6919 U95B. ... 3. Breast: ... accession number GSE70947. ... 4. Oral cavity: ... accession number GSE42743.
Dataset Splits	Yes	Approximately 75% of the observations are used as training set, and the mean square prediction error on the test set is reported. The ﬁnal model is selected using 5-fold cross validation using the 1 standard error rule . Results are presented for 100 splits of the data into training and test sets. We split the data as follows: 60% as training, 20% as validation and 20% as test data.
Hardware Specification	Yes	For 100 replications of the proximal problem on a quadcore Intel Core, i7-10510U CPU @1.80GHz, the median run-time with n = 500 for Pst = Psobolev was 693.20 µs.
Software Dependencies	No	The paper mentions using
Experiment Setup	Yes	We ﬁt each method over a sequence of 50 λ values on the training set, and select the tuning parameter λ which minimizes the test error ( ytest by 2 n). For the estimated model bfλ , we report the mean square error (MSE; bfλ f0 2 n) as a function of n. All methods were ﬁt for a sequence of λ values, using (λsp, λst) = (λ, λ2) for GSAMs. The λ value with the smallest area under the curve (AUC) for the ROC curve on the validation set was selected, and the corresponding model was used to classify samples in the test set.