LinCDE: Conditional Density Estimation via Lindsey's Method

Authors: Zijun Gao, Trevor Hastie

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate Lin CDE s efficacy through extensive simulations and three real data examples.
Researcher Affiliation Academia Zijun Gao EMAIL Department of Statistics Stanford University Stanford, CA 94305, USA Trevor Hastie EMAIL Department of Statistics and Department of Biomedical Data Science Stanford University Stanford, CA 94305, USA
Pseudocode Yes Algorithm 1: Lin CDE tree. Algorithm 2: Lin CDE boosting.
Open Source Code Yes Software for Lin CDE is made available as an R package at https://github.com/Zijun Gao/ Lin CDE. The package can be installed from Git Hub with install.packages("devtools") devtools :: install_github("Zijun Gao/Lin CDE", build_vignettes = TRUE)
Open Datasets Yes The Old Faithful Geyser data records the eruptions from the Old Faithful geyser in the Yellowstone National Park (Azzalini and Bowman, 1990). The human height data is taken from the NHANES data set: a series of health and nutrition surveys collected by the US National Center for Health Statistics (NCHS). The air pollution data (Wu and Dominici, 2020) focuses on PM2.5 exposures in the United States.
Dataset Splits Yes The training data set consists of 1000 i.i.d. samples. The performance is evaluated on an independent test data set of size 1000. For real data analysis: first, we split the samples into training and test data sets; next, we perform 5-fold crossvalidation on the training data set to select the hyper-parameters; finally, we apply the estimators with the selected hyper-parameters and evaluate multiple criteria on the test data set. For Human Height Data (larger subset): we split the data set for validation, training, and test (proportion 2 : 1 : 1). For Air Pollution Data: We split the data into training, validation, and test (proportion 2:1:1), and tune on the hold-out validation data.
Hardware Specification Yes The experiments are run on a personal computer with a dual-core CPU and 8GB memory.
Software Dependencies Yes Software for Lin CDE is made available as an R package at https://github.com/Zijun Gao/ Lin CDE. Quantile regression forest: R package quantregForest (Meinshausen, 2017). Distribution boosting: R package conTree (Friedman and Narasimhan, 2020).
Experiment Setup Yes By default, we use k = 10 transformed natural cubic splines and a Gaussian carrying density We use a small learning rate η = 0.01 to avoid overfitting. We use 40 discretization bins for training, and 20 or 50 for testing. The primary parameter is the number of trees (iteration number). Secondary tuning parameters include the tree size, the learning rate, and the ridge penalty parameter. On a separate validation data set, we experimented with a grid of secondary parameters, each associated with a sequence of iteration numbers, and select the best-performing configuration.