Distributed Bootstrap for Simultaneous Inference Under High Dimensionality
Authors: Yang Yu, Shih-Kang Chao, Guang Cheng
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset. The code to reproduce the numerical results is available in Supplementary Material. |
| Researcher Affiliation | Academia | Yang Yu EMAIL Department of Statistics Purdue University West Lafayette, IN 47907, USA; Shih-Kang Chao EMAIL Department of Statistics University of Missouri Columbia, MO 65211, USA; Guang Cheng EMAIL Department of Statistics University of California, Los Angeles Los Angeles, CA 90095, USA |
| Pseudocode | Yes | Algorithm 1 k-grad/n+k-1-grad with de-biased ℓ1-CSL estimator; Algorithm 2 Distributed K-fold cross-validation for t-step CSL; Algorithm 3 Dist Boots(method, eθ, {gj}k j=1, eΘ); Algorithm 4 Node(c M); Algorithm 5 Simultaneous inference for distributed data with heteroscedasticity |
| Open Source Code | Yes | The code to reproduce the numerical results is available in Supplementary Material. |
| Open Datasets | Yes | The US Airline On-Time Performance dataset (DVN, 2008), available at http://stat-computing. org/dataexpo/2009 |
| Dataset Splits | Yes | We randomly sample a dataset D1 of N = 500,000 observations, and conceptually distribute them across k = 1,000 nodes such that each node receives n = 500 observations. We randomly sample another dataset D2 of N = 500,000 observations for a pilot study to select relevant variables, where D1 D2 = . |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We consider a Gaussian linear model and a logistic regression model. We fix total sample size N = 214 and the dimension d = 210, and choose the number of machines k from {22, 23, . . . , 26}. The true coefficient θ is a d-dimensional vector in which the first s0 coordinates are 1 and the rest is 0, where s0 {22, 24} for the linear model and s0 {21, 23} for the GLM. ... For the ℓ1-CSL computation, we choose the initial λ(0) by a local K-fold cross-validation, where K = 10 for linear regression and K = 5 for logistic regression. For each iteration t, λ(t) is selected by Algorithm 2 in Section 2.4 with K folds with K = min{k 1, 5} ... At each replication, we draw B = 500 bootstrap samples, from which we calculate the 95% empirical quantile to further obtain the 95% simultaneous confidence interval. |