Selective Inference with Distributed Data
Authors: Sifan Liu, Snigdha Panigrahi
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach through simulations and an analysis of a medical data set on ICU admissions. |
| Researcher Affiliation | Academia | Sifan Liu EMAIL Department of Statistics Stanford University Stanford, CA 94305-4020, USA Snigdha Panigrahi EMAIL Department of Statistics University of Michigan Ann Arbor, MI 48109-1107, USA |
| Pseudocode | Yes | Algorithm 1: Communication of information, Algorithm 2: Approximate selective MLE-based inference, Algorithm 3: Multiple carving, Algorithm 4: General aggregation rules. |
| Open Source Code | Yes | Our code can be accessed from the Git Hub repository https://github.com/snigdhagit/Distributed-Selectinf. |
| Open Datasets | Yes | We illustrate an application of our procedure on a real data set that is publicly available on MIT s GOSSIS database Raffa et al. (2022). This data set contains records on intensive care unit (ICU) admissions from 192 hospitals... The same problem appeared in the 2021 Women in Data Science Datathon 2. https://www.kaggle.com/competitions/widsdatathon2021/data. Accessed on on Dec. 17, 2022. |
| Dataset Splits | Yes | The n observations are partitioned into K + 1 disjoint subsets D(0), D(1), . . . , D(K). Subsets 1 through K, representing the data stored at K local machines, are used for variable selection. Subset 0, representing the data at the central machine, is used only at the time of selective inference. The three data sets used for variable selection have sample sizes ranging from 1633 to 1788, and the data set reserved for inference has 2000 samples. |
| Hardware Specification | No | No specific hardware details (like CPU/GPU models, memory, or cloud instances) are provided in the paper. The paper mentions "running our simulations" and "average run time" but lacks specific hardware specifications. |
| Software Dependencies | No | The paper mentions that 'The original code is written in R, and we load them into Python when running our simulations', but does not specify version numbers for R, Python, or any other critical libraries or software packages. |
| Experiment Setup | Yes | The regularization parameter λ is set to 2 log p for linear regression and 0.5 log p for logistic regression. The signal strength c is 0.7 for the linear model and 1 for the logistic model. We use B = 10 replicates, and aggregate the p-values using formula (16) with γmin = 0.1. The proportion of samples used for variable selection is varied in the set {0.5, 0.6, . . . , 0.9}. We fix the significance level at 0.1. |