Best Linear Unbiased Estimate from Privatized Contingency Tables
Authors: Jordan Awan, Adam Edwards, Paul Bartholomew, Andrew Sillers
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate our method in simulation studies, comparing mean-squared error (MSE), confidence interval coverage/width, and computational time/memory. We also apply our methodology to two 2010 Census demonstration products, Redistricting Data (Public Law 94-171, which we abbreviate as PL94) and Demographic and Housing Characteristics (DHC), illustrating the scalability and validity of our methods. |
| Researcher Affiliation | Collaboration | Jordan Awan EMAIL Department of Statistics Purdue University West Lafayette, IN 47907, USA, Adam Edwards EMAIL The MITRE Corporation Mc Lean, VA 22102, USA |
| Pseudocode | Yes | Algorithm 1 Pseudo Code for Collection Step; Algorithm 2 Pseudo Code for Down Pass |
| Open Source Code | Yes | The code for these experiments is available at https://github.com/Jordan Awan/Sea Blue. |
| Open Datasets | Yes | We use the 2010 demonstration products for PL94 and DHC, rather than the official 2020 releases. ... We use these demonstration products in our simulations because we are able to use the Census 2010 Summary File 1 (SF1) counts as the approximately true values for the same geographies. SF1 is the official 2010 tabulation of Census counts and these SF1 values are used to assess mean squared error (MSE) and interval coverage, whereas no analogous product was available for the 2020 release. |
| Dataset Splits | No | We used a sample of 40 Census blocks from Rhode Island for PL94, and for DHC, we used state-level counts from Rhode Island. For PL94, our replications were simulated by taking a sample of geographies, since each geography only has one set of counts. ... In our controlled settings, we generate data for a k k table for k 3, 4, 5, 6, meaning that there are k variables which each have k levels, and we observe all possible noisy margins. |
| Hardware Specification | Yes | Tests were run on a Windows laptop with an Intel i77820HQ CPU (4 cores / 2.90GHz) and 8 GB RAM. |
| Software Dependencies | No | Simulations were conducted in Python, where a custom script was written to implement SEA BLUE, and numpy functions were used to perform the matrix operations needed to implement the matrix projection. |
| Experiment Setup | Yes | In our controlled settings, we generate data for a k k table for k 3, 4, 5, 6, meaning that there are k variables which each have k levels, and we observe all possible noisy margins. For a single geography we generate the detailed table using a zero inflated Poisson distribution, and aggregate back to the total count. Noise is added using a normal distribution to each count type independently. ... The default variance for each count is set to 2, while in the listed tables the variance was set to be either 1, 2, or 3, uniformly at random. ... Simulation was conducted with normally distributed random variables, α = .05, and 100,000 replicates. |