Personalized Privacy Amplification via Importance Sampling

Authors: Dominik Fay, Sebastian Mair, Jens Sjölund

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate both approaches empirically in terms of privacy, efficiency, and accuracy on the differentially private k-means problem. We observe that both approaches yield similar outcomes and consistently outperform uniform sampling across a wide range of data sets. Our code is available on Git Hub.1
Researcher Affiliation Collaboration Dominik Fay EMAIL Elekta and KTH Royal Institute of Technology, Stockholm, Sweden Sebastian Mair EMAIL Linköping University and Uppsala University, Sweden Jens Sjölund EMAIL Uppsala University, Sweden
Pseudocode Yes Algorithm 1 Optimization for privacy-constrained importance weights
Open Source Code Yes Our code is available on Git Hub.1
Open Datasets Yes We use the following eight real-world data sets: Covertype (Blackard & Dean, 1999) (n = 581,012, d = 54), FMA2 (Defferrard et al., 2017) (n = 106,574, d = 518), Ijcnn1 3 (Chang & Lin, 2001) (n = 49,990, d = 22), KDD-Protein4 (n = 145,751, d = 74), Mini Boo NE (Dua & Graff, 2017) (n = 130,064, d = 50), Pose5 (Catalin Ionescu, 2011; Ionescu et al., 2014) (n = 35,832, d = 48), RNA (Uzilov et al., 2006) (n = 488,565, d = 8), and Song (Bertin-Mahieux et al., 2011) (n = 515,345, d = 90).
Dataset Splits No The paper does not provide explicit training/test/validation dataset splits, percentages, or specific file names for custom splits. It mentions preprocessing steps like setting an ℓ2 cut-off point and centering the data, but no partition details for reproducibility of data splits.
Hardware Specification Yes All experiments run on a dual AMD Epyc machine with 2 × 64 cores with 2.25 GHz and 2 Ti B of memory.
Software Dependencies No The code6 is implemented in Python using numpy (Harris et al., 2020). Algorithm 1 is implemented in Rust. The paper mentions software but does not specify version numbers for Python, Rust, or numpy, which are crucial for reproducibility.
Experiment Setup Yes We fix the number of iterations of DP-Lloyd to T = 10 and the number of clusters to k = 25. The scale parameters are set to βsum = q d 2ρ and βcount = 3p 4dρ2βsum, where ρ = 0.225 as suggested by Su et al. (2016). Here, B is a constant controlling the noise scales that we select to achieve a specific target epsilon ϵ ∈ {0.5, 1, 3, 10, 30, 100, 300, 1000.0} for a given (expected) subsample size m and vice versa. ... For the coreset-based (core) sampling, the sampling distribution is qcore(x) = λ m n + (1 − λ) m x 2 2 n x , where we set λ = 1/2. ... For the privacy-constrained (opt) sampling, we compute qpriv(x) numerically via Algorithm 1 for the target ϵ = (r/βsum + 1/βcount)T using the strong convexity constant from Equation (4). ... Furthermore, we initialize the cluster centers as k random data points.