Personalized Privacy Amplification via Importance Sampling
Authors: Dominik Fay, Sebastian Mair, Jens Sjölund
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate both approaches empirically in terms of privacy, efficiency, and accuracy on the differentially private k-means problem. We observe that both approaches yield similar outcomes and consistently outperform uniform sampling across a wide range of data sets. Our code is available on Git Hub.1 |
| Researcher Affiliation | Collaboration | Dominik Fay EMAIL Elekta and KTH Royal Institute of Technology, Stockholm, Sweden Sebastian Mair EMAIL Linköping University and Uppsala University, Sweden Jens Sjölund EMAIL Uppsala University, Sweden |
| Pseudocode | Yes | Algorithm 1 Optimization for privacy-constrained importance weights |
| Open Source Code | Yes | Our code is available on Git Hub.1 |
| Open Datasets | Yes | We use the following eight real-world data sets: Covertype (Blackard & Dean, 1999) (n = 581,012, d = 54), FMA2 (Defferrard et al., 2017) (n = 106,574, d = 518), Ijcnn1 3 (Chang & Lin, 2001) (n = 49,990, d = 22), KDD-Protein4 (n = 145,751, d = 74), Mini Boo NE (Dua & Graff, 2017) (n = 130,064, d = 50), Pose5 (Catalin Ionescu, 2011; Ionescu et al., 2014) (n = 35,832, d = 48), RNA (Uzilov et al., 2006) (n = 488,565, d = 8), and Song (Bertin-Mahieux et al., 2011) (n = 515,345, d = 90). |
| Dataset Splits | No | The paper does not provide explicit training/test/validation dataset splits, percentages, or specific file names for custom splits. It mentions preprocessing steps like setting an ℓ2 cut-off point and centering the data, but no partition details for reproducibility of data splits. |
| Hardware Specification | Yes | All experiments run on a dual AMD Epyc machine with 2 × 64 cores with 2.25 GHz and 2 Ti B of memory. |
| Software Dependencies | No | The code6 is implemented in Python using numpy (Harris et al., 2020). Algorithm 1 is implemented in Rust. The paper mentions software but does not specify version numbers for Python, Rust, or numpy, which are crucial for reproducibility. |
| Experiment Setup | Yes | We fix the number of iterations of DP-Lloyd to T = 10 and the number of clusters to k = 25. The scale parameters are set to βsum = q d 2ρ and βcount = 3p 4dρ2βsum, where ρ = 0.225 as suggested by Su et al. (2016). Here, B is a constant controlling the noise scales that we select to achieve a specific target epsilon ϵ ∈ {0.5, 1, 3, 10, 30, 100, 300, 1000.0} for a given (expected) subsample size m and vice versa. ... For the coreset-based (core) sampling, the sampling distribution is qcore(x) = λ m n + (1 − λ) m x 2 2 n x , where we set λ = 1/2. ... For the privacy-constrained (opt) sampling, we compute qpriv(x) numerically via Algorithm 1 for the target ϵ = (r/βsum + 1/βcount)T using the strong convexity constant from Equation (4). ... Furthermore, we initialize the cluster centers as k random data points. |