reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LinCDE: Conditional Density Estimation via Lindsey's Method

Authors: Zijun Gao, Trevor Hastie

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate Lin CDE s eﬃcacy through extensive simulations and three real data examples.
Researcher Affiliation	Academia	Zijun Gao EMAIL Department of Statistics Stanford University Stanford, CA 94305, USA Trevor Hastie EMAIL Department of Statistics and Department of Biomedical Data Science Stanford University Stanford, CA 94305, USA
Pseudocode	Yes	Algorithm 1: Lin CDE tree. Algorithm 2: Lin CDE boosting.
Open Source Code	Yes	Software for Lin CDE is made available as an R package at https://github.com/Zijun Gao/ Lin CDE. The package can be installed from Git Hub with install.packages("devtools") devtools :: install_github("Zijun Gao/Lin CDE", build_vignettes = TRUE)
Open Datasets	Yes	The Old Faithful Geyser data records the eruptions from the Old Faithful geyser in the Yellowstone National Park (Azzalini and Bowman, 1990). The human height data is taken from the NHANES data set: a series of health and nutrition surveys collected by the US National Center for Health Statistics (NCHS). The air pollution data (Wu and Dominici, 2020) focuses on PM2.5 exposures in the United States.
Dataset Splits	Yes	The training data set consists of 1000 i.i.d. samples. The performance is evaluated on an independent test data set of size 1000. For real data analysis: ﬁrst, we split the samples into training and test data sets; next, we perform 5-fold crossvalidation on the training data set to select the hyper-parameters; ﬁnally, we apply the estimators with the selected hyper-parameters and evaluate multiple criteria on the test data set. For Human Height Data (larger subset): we split the data set for validation, training, and test (proportion 2 : 1 : 1). For Air Pollution Data: We split the data into training, validation, and test (proportion 2:1:1), and tune on the hold-out validation data.
Hardware Specification	Yes	The experiments are run on a personal computer with a dual-core CPU and 8GB memory.
Software Dependencies	Yes	Software for Lin CDE is made available as an R package at https://github.com/Zijun Gao/ Lin CDE. Quantile regression forest: R package quantregForest (Meinshausen, 2017). Distribution boosting: R package conTree (Friedman and Narasimhan, 2020).
Experiment Setup	Yes	By default, we use k = 10 transformed natural cubic splines and a Gaussian carrying density We use a small learning rate η = 0.01 to avoid overﬁtting. We use 40 discretization bins for training, and 20 or 50 for testing. The primary parameter is the number of trees (iteration number). Secondary tuning parameters include the tree size, the learning rate, and the ridge penalty parameter. On a separate validation data set, we experimented with a grid of secondary parameters, each associated with a sequence of iteration numbers, and select the best-performing conﬁguration.