Best Linear Unbiased Estimate from Privatized Contingency Tables

Authors: Jordan Awan, Adam Edwards, Paul Bartholomew, Andrew Sillers

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate our method in simulation studies, comparing mean-squared error (MSE), confidence interval coverage/width, and computational time/memory. We also apply our methodology to two 2010 Census demonstration products, Redistricting Data (Public Law 94-171, which we abbreviate as PL94) and Demographic and Housing Characteristics (DHC), illustrating the scalability and validity of our methods.
Researcher Affiliation Collaboration Jordan Awan EMAIL Department of Statistics Purdue University West Lafayette, IN 47907, USA, Adam Edwards EMAIL The MITRE Corporation Mc Lean, VA 22102, USA
Pseudocode Yes Algorithm 1 Pseudo Code for Collection Step; Algorithm 2 Pseudo Code for Down Pass
Open Source Code Yes The code for these experiments is available at https://github.com/Jordan Awan/Sea Blue.
Open Datasets Yes We use the 2010 demonstration products for PL94 and DHC, rather than the official 2020 releases. ... We use these demonstration products in our simulations because we are able to use the Census 2010 Summary File 1 (SF1) counts as the approximately true values for the same geographies. SF1 is the official 2010 tabulation of Census counts and these SF1 values are used to assess mean squared error (MSE) and interval coverage, whereas no analogous product was available for the 2020 release.
Dataset Splits No We used a sample of 40 Census blocks from Rhode Island for PL94, and for DHC, we used state-level counts from Rhode Island. For PL94, our replications were simulated by taking a sample of geographies, since each geography only has one set of counts. ... In our controlled settings, we generate data for a k k table for k 3, 4, 5, 6, meaning that there are k variables which each have k levels, and we observe all possible noisy margins.
Hardware Specification Yes Tests were run on a Windows laptop with an Intel i77820HQ CPU (4 cores / 2.90GHz) and 8 GB RAM.
Software Dependencies No Simulations were conducted in Python, where a custom script was written to implement SEA BLUE, and numpy functions were used to perform the matrix operations needed to implement the matrix projection.
Experiment Setup Yes In our controlled settings, we generate data for a k k table for k 3, 4, 5, 6, meaning that there are k variables which each have k levels, and we observe all possible noisy margins. For a single geography we generate the detailed table using a zero inflated Poisson distribution, and aggregate back to the total count. Noise is added using a normal distribution to each count type independently. ... The default variance for each count is set to 2, while in the listed tables the variance was set to be either 1, 2, or 3, uniformly at random. ... Simulation was conducted with normally distributed random variables, α = .05, and 100,000 replicates.