Differentially Private Hypothesis Testing for Linear Regression
Authors: Daniel G. Alabi, Salil P. Vadhan
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a suite of Monte Carlo based experiments, we show that our tests achieve desired significance levels and have a high power that approaches the power of the non-private tests as we increase sample sizes or the privacy-loss parameter. We also show when our tests outperform existing methods in the literature. ... Experimental evaluation of our hypothesis tests is done on: 1. Synthetic Data: We generate synthetic datasets with different distributions on the independent (or explanatory) variables. ... 2. Opportunity Insights (OI) Data: We use a simulated version of the data ... 3. Bike Sharing Dataset: We use a real-world dataset publicly available in the UCI machine learning repository. |
| Researcher Affiliation | Academia | Daniel G. Alabi EMAIL Data Science Institute Columbia University New York, NY 10027, USA Salil P. Vadhan EMAIL John A. Paulson School of Engineering & Applied Sciences Harvard University Allston, MA 02134, USA |
| Pseudocode | Yes | Algorithm 1: Monte Carlo DP Test Framework. Algorithm 2: ρ-z CDP procedure DPStats L Algorithm 3: ρ-z CDP procedure DPBern Algorithm 4: ρ-z CDP procedure DPStats M Algorithm 5: ρ-z CDP procedure DPKW Algorithm 6: DP Test Framework via Parametric Bootstrap Confidence Intervals. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of their source code or a link to a code repository. |
| Open Datasets | Yes | 2. Opportunity Insights (OI) Data: We use a simulated version of the data used by the Opportunity Insights team (an economics research lab) to release the Opportunity Atlas tool, primarily used to predict social and economic mobility. 3. Bike Sharing Dataset: We use a real-world dataset publicly available in the UCI machine learning repository. The dataset consists of daily and hourly counts (with other information such as seasonal and weather information) of bike rentals in the Capital bikeshare system in years 2011 and 2012. ... We use the UCI bike dataset (Fanaee-T and Gama, 2014) with 17,389 instances. |
| Dataset Splits | No | The paper mentions generating synthetic datasets of varying sizes and random selection of tracts for the OI data, but does not provide specific training/test/validation splits (e.g., percentages, sample counts, or predefined split references). |
| Hardware Specification | No | The paper does not contain any specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies, such as library names with version numbers, used to replicate the experiment. |
| Experiment Setup | Yes | For experimental evaluation on synthetic datasets, we generated datasets with sizes between n = 100 and n = 10,000. For both the linear relationship and mixture model tests on synthetic data below, we consider a subset of the following values of the privacy budget ρ: {0.12/2, 0.52/2, 12/2, 22/2, 32/2, 52/2, 102/2}. We draw the independent variables x1, . . . , xn according to a few different distributions: Normal, Uniform, Exponential. ... For all tests below, the clipping parameter is either set to = 2 or = 3. For estimating the power and significance, we fix the target significance level to 0.05 and run Monte Carlo tests 2000 times. |