Optimal Survey Design for Private Mean Estimation

Authors: Yu-Wei Chen, Raghu Pasupathy, Jordan Awan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We numerically illustrate our method through simulation studies. Section 5.1 compares compares variances between naive and DP-aware stratified sampling. Section 5.2 explores the interplay between the non-private and purely DP designs. Section 5.4 showcases the computational efficiency of our algorithm. The input of Algorithm 1, x , is obtained by package nloptr and alabama in R. All computations, including runtime measurements, were conducted on the Purdue Bell clusters using multiple cores. The source codes are available at https://github.com/garyUAchen/DP_Optim_Survey.
Researcher Affiliation Academia 1Department of Statistics, Purdue University, West Lafayette IN, USA. Correspondence to: Jordan Awan <EMAIL>.
Pseudocode Yes Algorithm 1 Integer-Optimal Design Input: x (the optimal continuous solution) and Hessian matrix of g : Hg(x ) for i = 1, . . . , k 1 do Define Ti = {ni N : x i ni x i } end for Define T = {(n1, . . . , nk 1, nk) : nk = η Pk 1 i=1 ni, where (n1, . . . , nk 1) T1 . . . Tk 1} Select ninit. = arg minn T g(n) Calculate the smallest eigenvalue λ of Hg(x ) Calculate radius r = p 2(g(ninit.) g(x ))/λ for i = 1, . . . , k 1 do Define Si = {ni N : max(x i r, 1) ni min(x i + r, Ni, η k + 1)} end for Define S = {(n1, . . . , nk 1, nk) : nk = η Pk 1 i=1 ni, where (n1, . . . , nk 1) S1 . . . Sk 1} Select n = arg minn S g(n) by an exhaustive search. Output: n
Open Source Code Yes The source codes are available at https://github.com/garyUAchen/DP_Optim_Survey.
Open Datasets No The paper describes simulation scenarios with synthetic parameters for population sizes and variances, such as: "In this simulation, there are 4 groups with population sizes N = (7000, 8000, 9000, 10000) and variance σ2 = (0.08, 0.082, 0.083, 0.084) and a total sample size η = 200." There is no mention of external public datasets or access information for any dataset.
Dataset Splits No The paper describes simulation setups using synthetic parameters, not a pre-existing dataset that would require splitting into training, validation, or test sets. Therefore, no dataset split information is provided.
Hardware Specification No All computations, including runtime measurements, were conducted on the Purdue Bell clusters using multiple cores. While a specific cluster name is mentioned, details such as the CPU model, exact number of cores, or memory specifications are not provided, which are necessary for a specific hardware description.
Software Dependencies No The input of Algorithm 1, x , is obtained by package nloptr and alabama in R. This indicates the use of R and specific packages (nloptr and alabama), but no version numbers for R or the packages are provided.
Experiment Setup Yes In this simulation, there are 4 groups with population sizes N = (7000, 8000, 9000, 10000) and variance σ2 = (0.08, 0.082, 0.083, 0.084) and a total sample size η = 200. We plot the variance ratio from a naive subsampling scheme to that of the integer-optimal design while varying ϵ from 0.01 to 100.