Differentially private methods for managing model uncertainty in linear regression
Authors: Víctor Peña, Andrés F. Barrientos
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate the performance of our methods in Sections 4.4 and 5.2. We include additional results from the simulation studies in the Appendix. [...] We evaluate the performance of the methods described in this section in a simulation study and an application. |
| Researcher Affiliation | Academia | Vı́ctor Péna EMAIL Department d Estadı́stica i Investigació Operativa Universitat Politècnica de Catalunya Barcelona, Spain; Andrés F. Barrientos EMAIL Department of Statistics Florida State University Tallahassee, FL 32306, USA |
| Pseudocode | No | The paper references existing algorithms like "Algorithm 1 in Balle and Wang (2018)" and "Algorithm 2 in Sheffet (2019)", but it does not include any structured pseudocode or algorithm blocks for its own methodology. |
| Open Source Code | No | The paper states: "They are also conveniently implemented in the R package library(BAS) (Clyde, 2020)." and "We implement the methods with the R package BAS (Clyde, 2020)." This indicates the authors used an existing R package for their implementation but does not provide explicit access to their own specific source code for the methodology described in this paper. |
| Open Datasets | Yes | We analyze a random sample of 200 students from the High School and Beyond survey, which was conducted by the National Center of Education Statistics. We obtained the data from Diez et al. (2012). In R, they are available as data(hsb2) in library(openintro). [...] The data set includes n = 49,436 heads of households with non-negative incomes. We consider 6 predictors: age in years (β1), age squared (β2), marital status (β3), sex (β4), education (β5), and race (β6). All predictors are numeric or binary except for education, which is an ordinal variable. To reduce the number of coefficients in the model, we treat education as numeric, ranging from 1 (for less than 1st grade) to 16 (for doctoral degree). The binary predictors are: marital status (1: civilian spouse present; 0: otherwise), sex (1: male; 0: female), and race (1: white; 0: otherwise). The response variable is income. In this application, the non-private inclusion probabilities are all close to one. To provide a more challenging benchmark for our methods, we permute the rows for marital status and education in the design matrix to artificially make the inclusion probabilities for β3 and β5 close to zero. The predictors and the response are centered and rescaled to the interval (−0.5, 0.5). Figure 5 displays the posterior expected values of β1, β3, and β4 with the Zellner-Siow prior and ε = 0.9. We use the histograms described in Section 5.1 to define approximate 95% confidence sets for T(G) = E(βj | G). Our choice of matrix norm is the Frobenius norm. Specifically, we run our procedure 250 times and, for each run and a fixed collection of bins B1, . . . , BK, we summarize each T( ˆC1 α) with its corresponding histogram Hist(T, ˆC0.95) = {(Bk, dk)}K k=1. |
| Dataset Splits | No | The paper mentions "splitting the data into M disjoint subgroups" for the differential privacy mechanism and "simulating random data splits" or "simulate 1,000 data sets" for simulation studies. However, it does not specify any standard training/test/validation splits for the real-world datasets used in applications. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments or simulations. |
| Software Dependencies | Yes | Our work is based on mixtures of g-priors because, when combined with right-Haar priors on the common parameters, they satisfy a list of appealing criteria proposed in Bayarri et al. (2012). They are also conveniently implemented in the R package library(BAS) (Clyde, 2020). |
| Experiment Setup | Yes | The subsample and aggregate technique requires the specification of censoring limits L < U and a number of subgroups M. These choices affect the performance of the methods [...] We consider ε {0.5, 0.9} and, in the case of the Wishart mechanism, we set δ = 1/n. [...] In the simulation study, we found Bayes factors with the Zellner-Siow prior (ZS) and information criteria with BIC. The prior distribution on the model space π(γ) is the hierarchical uniform prior proposed in Scott and Berger (2010). [...] In all cases, we add a regularization parameter r to the diagonal entries of G . For the Laplace mechanism, we set r to be the 99-th percentile of eigmin(E), which we find via simulation. For the Wishart mechanism, we use the analytical expression in Remark 2 of Sheffet (2019). [...] We simulate data from a normal linear model with p predictors, where p is set to 2, 6, or 9. The sample size n (in thousands) varies from 5 to 10,000. The number of active predictors in the true model |T| depends on the value of p and ranges from 0 (null model is true) to p (full model is true). Specifically, if p = 2, we set |T| {0, 1, 2}; if p = 6, we set |T| {0, 3, 6}; and if p = 9, we set |T| {0, 4, 9}. The predictors are independently drawn from the uniform distribution on (-2, 2). Following Hastie et al. (2017), we define the signal-to-noise ratio (SNR) as the variance of the regression mean (which is random, since we are simulating predictors and β) divided by σ2. In our simulations, we assume that the intercept is zero and β is a p-dimensional vector equal to b[1, . . . , 1] . We use optimization to find σ2 and b such that SNR = 0.5 and the response falls within (-2, 2) with high probability. For each combination of |T| and n, we simulate 1,000 data sets. All the data sets we simulated are such that the response falls in (-2, 2). We consider ε {0.5, 0.9} and, in the case of the Wishart mechanism, we set δ = 1/n. We assess the performance of the methods by tracking Monte Carlo averages of predictive mean squared errors and the posterior probability of the true model. [...] The predictors and the response are centered and rescaled to the interval (-0.5, 0.5). We use the histograms described in Section 5.1 to define approximate 95% confidence sets for T(G) = E(βj | G). |