Privacy Amplification Through Synthetic Data: Insights from Linear Regression
Authors: Clément Pierquin, Aurélien Bellet, Marc Tommasi, Matthieu Boussard
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies, however, suggest that synthetic data may offer even stronger privacy protection than the theoretical guarantees provided by the model (Annamalai et al., 2024). This suggests that certain structural properties of the data or the generative process itself could contribute to an implicit privacy amplification effect. ... Our results are two-fold. First, in Section 3, we present negative results in scenarios where an adversary controls the seed of the synthetic data generation process. ... Second, in Section 4, we analyze the privacy guarantees when synthetic data is generated from random inputs to a private regression model obtained via output perturbation. We demonstrate that privacy amplification is possible in this setting, depending on the model size and the number of released synthetic samples. All proofs can be found in the supplementary material. ... Figure 1 represents a numerical computation of the upper bound. ... Figure 2 reports the logarithm of the estimated divergence as a function of log(d) for multiple values of . ... Figure 3 suggests an approximate convergence rate of O(d 1/2) when l and n are fixed. ... Methodology of experiments releasing one output. ... Methodology of experiments releasing multiple outputs. |
| Researcher Affiliation | Collaboration | 1Craft AI, Paris, France 2Université de Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRISt AL, F-59000 Lille, France 3 Inria, Université de Montpellier, INSERM. Correspondence to: Clément Pierquin <EMAIL>. |
| Pseudocode | No | The paper describes algorithms like Noisy Gradient Descent (NGD) with mathematical update rules (e.g., 'Vt+1 = Vt η w F(Vt, X, Y ) + p Wt+1 = Wt η w F(Wt, X, Y ) + p' in Section 3.2), but it does not present these or any other procedures in a structured pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code, nor does it include links to a code repository in the main text or supplementary material. |
| Open Datasets | No | The paper focuses on theoretical analysis and numerical computation of privacy bounds in the context of linear regression. Its 'experiments' involve numerical estimations of theoretical constructs and parameter settings (e.g., 'Let v, w R1 d.', 'We set σθ = 1.'). It does not use or refer to any specific publicly available datasets for empirical evaluation. |
| Dataset Splits | No | The paper does not conduct experiments on empirical datasets; instead, it performs numerical computations based on theoretical models. Therefore, there are no mentions of training/test/validation dataset splits. |
| Hardware Specification | No | The paper describes theoretical analysis and numerical computations, but it does not specify any particular hardware (e.g., CPU, GPU models, or cloud resources) used to perform these computations. |
| Software Dependencies | No | In Appendix B.7, the paper mentions using `scipy.optimize.brentq`, `scipy.stats.norm.cdf`, `scipy.stats.norm.ppf`, `scipy.stats.chi2.cdf`, `scipy.stats.chi2.ppf`, and `np.finfo`. While these indicate the use of Python libraries like SciPy and NumPy, specific version numbers for these components are not provided. |
| Experiment Setup | Yes | In Appendix B.7 ('Methodology of experiments releasing one output' and 'Methodology of experiments releasing multiple outputs'), the paper provides details on the numerical setup: 'We set the number of samples L = 5 105. We draw M = 50 times (Xk,k )1 k L,1 k M iid Unif([0, 1]). Then, we run the procedure and the estimates are averaged: elα(h) = 1 M(α 1) k =1 |h (Xk,k )|1 α. We repeat the process for different values of d and . We set σθ = 1.' |