Distributed Bootstrap for Simultaneous Inference Under High Dimensionality

Authors: Yang Yu, Shih-Kang Chao, Guang Cheng

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset. The code to reproduce the numerical results is available in Supplementary Material.
Researcher Affiliation Academia Yang Yu EMAIL Department of Statistics Purdue University West Lafayette, IN 47907, USA; Shih-Kang Chao EMAIL Department of Statistics University of Missouri Columbia, MO 65211, USA; Guang Cheng EMAIL Department of Statistics University of California, Los Angeles Los Angeles, CA 90095, USA
Pseudocode Yes Algorithm 1 k-grad/n+k-1-grad with de-biased ℓ1-CSL estimator; Algorithm 2 Distributed K-fold cross-validation for t-step CSL; Algorithm 3 Dist Boots(method, eθ, {gj}k j=1, eΘ); Algorithm 4 Node(c M); Algorithm 5 Simultaneous inference for distributed data with heteroscedasticity
Open Source Code Yes The code to reproduce the numerical results is available in Supplementary Material.
Open Datasets Yes The US Airline On-Time Performance dataset (DVN, 2008), available at http://stat-computing. org/dataexpo/2009
Dataset Splits Yes We randomly sample a dataset D1 of N = 500,000 observations, and conceptually distribute them across k = 1,000 nodes such that each node receives n = 500 observations. We randomly sample another dataset D2 of N = 500,000 observations for a pilot study to select relevant variables, where D1 D2 = .
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup Yes We consider a Gaussian linear model and a logistic regression model. We fix total sample size N = 214 and the dimension d = 210, and choose the number of machines k from {22, 23, . . . , 26}. The true coefficient θ is a d-dimensional vector in which the first s0 coordinates are 1 and the rest is 0, where s0 {22, 24} for the linear model and s0 {21, 23} for the GLM. ... For the ℓ1-CSL computation, we choose the initial λ(0) by a local K-fold cross-validation, where K = 10 for linear regression and K = 5 for logistic regression. For each iteration t, λ(t) is selected by Algorithm 2 in Section 2.4 with K folds with K = min{k 1, 5} ... At each replication, we draw B = 500 bootstrap samples, from which we calculate the 95% empirical quantile to further obtain the 95% simultaneous confidence interval.