reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Differentially Private Hypothesis Testing for Linear Regression

Authors: Daniel G. Alabi, Salil P. Vadhan

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a suite of Monte Carlo based experiments, we show that our tests achieve desired significance levels and have a high power that approaches the power of the non-private tests as we increase sample sizes or the privacy-loss parameter. We also show when our tests outperform existing methods in the literature. ... Experimental evaluation of our hypothesis tests is done on: 1. Synthetic Data: We generate synthetic datasets with different distributions on the independent (or explanatory) variables. ... 2. Opportunity Insights (OI) Data: We use a simulated version of the data ... 3. Bike Sharing Dataset: We use a real-world dataset publicly available in the UCI machine learning repository.
Researcher Affiliation	Academia	Daniel G. Alabi EMAIL Data Science Institute Columbia University New York, NY 10027, USA Salil P. Vadhan EMAIL John A. Paulson School of Engineering & Applied Sciences Harvard University Allston, MA 02134, USA
Pseudocode	Yes	Algorithm 1: Monte Carlo DP Test Framework. Algorithm 2: ρ-z CDP procedure DPStats L Algorithm 3: ρ-z CDP procedure DPBern Algorithm 4: ρ-z CDP procedure DPStats M Algorithm 5: ρ-z CDP procedure DPKW Algorithm 6: DP Test Framework via Parametric Bootstrap Confidence Intervals.
Open Source Code	No	The paper does not contain an explicit statement about the release of their source code or a link to a code repository.
Open Datasets	Yes	2. Opportunity Insights (OI) Data: We use a simulated version of the data used by the Opportunity Insights team (an economics research lab) to release the Opportunity Atlas tool, primarily used to predict social and economic mobility. 3. Bike Sharing Dataset: We use a real-world dataset publicly available in the UCI machine learning repository. The dataset consists of daily and hourly counts (with other information such as seasonal and weather information) of bike rentals in the Capital bikeshare system in years 2011 and 2012. ... We use the UCI bike dataset (Fanaee-T and Gama, 2014) with 17,389 instances.
Dataset Splits	No	The paper mentions generating synthetic datasets of varying sizes and random selection of tracts for the OI data, but does not provide specific training/test/validation splits (e.g., percentages, sample counts, or predefined split references).
Hardware Specification	No	The paper does not contain any specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies, such as library names with version numbers, used to replicate the experiment.
Experiment Setup	Yes	For experimental evaluation on synthetic datasets, we generated datasets with sizes between n = 100 and n = 10,000. For both the linear relationship and mixture model tests on synthetic data below, we consider a subset of the following values of the privacy budget ρ: {0.12/2, 0.52/2, 12/2, 22/2, 32/2, 52/2, 102/2}. We draw the independent variables x1, . . . , xn according to a few different distributions: Normal, Uniform, Exponential. ... For all tests below, the clipping parameter is either set to = 2 or = 3. For estimating the power and significance, we fix the target significance level to 0.05 and run Monte Carlo tests 2000 times.