reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scalable Resampling in Massive Generalized Linear Models via Subsampled Residual Bootstrap

Authors: Indrila Ganguly, Srijan Sengupta, Sujit Ghosh

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the empirical performance of SRB via simulation studies and a real data analysis of the Forest Covertype data from the UCI Machine Learning Repository.
Researcher Affiliation	Academia	Indrila Ganguly EMAIL Biostatistics, Bioinformatics and Epidemiology Program Fred Hutchinson Cancer Center Seattle, WA 98109, USA Srijan Sengupta EMAIL Department of Statistics North Carolina State University Raleigh, NC 27695-7103, USA Sujit Ghosh EMAIL Department of Statistics North Carolina State University Raleigh, NC 27695-7103, USA
Pseudocode	Yes	Figure 1: Comparison of Residual Bootstrap and Subsampled Residual Bootstrap methods for GLMs
Open Source Code	No	The paper does not provide explicit statements or links to open-source code for the methodology described.
Open Datasets	Yes	We used the proposed SRB method to analyze the Forest Cover type data obtained from UCI Machine Learning Repository (Blackard, 1998).
Dataset Splits	No	The paper mentions using a subset of the data for real data analysis (n = 495,141 observations) and generating data for simulations, but it does not specify any training/test/validation dataset splits for reproducibility.
Hardware Specification	No	The paper does not provide any specific hardware details used for running the experiments.
Software Dependencies	No	To perform logistic and Poisson regression, we employed the glm() function from the stats package in R, using the default starting values for the iteratively re-weighted least squares procedure.
Experiment Setup	Yes	For each GLM setting, we generated M = 48 data sets and carried out B = 25 iterations of SRB and RB for each data set. This choice of M and B ensures that the standard error of the average error rate is below 0.01 (see the Appendix for a proof). Each iteration of SRB or RB involves R = 100 resamples. For SRB, we take b = nγ with γ {0.5, 0.6, 0.7, 0.8, 0.9}.