reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Privacy Amplification Through Synthetic Data: Insights from Linear Regression

Authors: Clément Pierquin, Aurélien Bellet, Marc Tommasi, Matthieu Boussard

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical studies, however, suggest that synthetic data may offer even stronger privacy protection than the theoretical guarantees provided by the model (Annamalai et al., 2024). This suggests that certain structural properties of the data or the generative process itself could contribute to an implicit privacy amplification effect. ... Our results are two-fold. First, in Section 3, we present negative results in scenarios where an adversary controls the seed of the synthetic data generation process. ... Second, in Section 4, we analyze the privacy guarantees when synthetic data is generated from random inputs to a private regression model obtained via output perturbation. We demonstrate that privacy amplification is possible in this setting, depending on the model size and the number of released synthetic samples. All proofs can be found in the supplementary material. ... Figure 1 represents a numerical computation of the upper bound. ... Figure 2 reports the logarithm of the estimated divergence as a function of log(d) for multiple values of . ... Figure 3 suggests an approximate convergence rate of O(d 1/2) when l and n are fixed. ... Methodology of experiments releasing one output. ... Methodology of experiments releasing multiple outputs.
Researcher Affiliation	Collaboration	1Craft AI, Paris, France 2Université de Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRISt AL, F-59000 Lille, France 3 Inria, Université de Montpellier, INSERM. Correspondence to: Clément Pierquin <EMAIL>.
Pseudocode	No	The paper describes algorithms like Noisy Gradient Descent (NGD) with mathematical update rules (e.g., 'Vt+1 = Vt η w F(Vt, X, Y ) + p Wt+1 = Wt η w F(Wt, X, Y ) + p' in Section 3.2), but it does not present these or any other procedures in a structured pseudocode or algorithm block.
Open Source Code	No	The paper does not provide any explicit statements about the release of source code, nor does it include links to a code repository in the main text or supplementary material.
Open Datasets	No	The paper focuses on theoretical analysis and numerical computation of privacy bounds in the context of linear regression. Its 'experiments' involve numerical estimations of theoretical constructs and parameter settings (e.g., 'Let v, w R1 d.', 'We set σθ = 1.'). It does not use or refer to any specific publicly available datasets for empirical evaluation.
Dataset Splits	No	The paper does not conduct experiments on empirical datasets; instead, it performs numerical computations based on theoretical models. Therefore, there are no mentions of training/test/validation dataset splits.
Hardware Specification	No	The paper describes theoretical analysis and numerical computations, but it does not specify any particular hardware (e.g., CPU, GPU models, or cloud resources) used to perform these computations.
Software Dependencies	No	In Appendix B.7, the paper mentions using `scipy.optimize.brentq`, `scipy.stats.norm.cdf`, `scipy.stats.norm.ppf`, `scipy.stats.chi2.cdf`, `scipy.stats.chi2.ppf`, and `np.finfo`. While these indicate the use of Python libraries like SciPy and NumPy, specific version numbers for these components are not provided.
Experiment Setup	Yes	In Appendix B.7 ('Methodology of experiments releasing one output' and 'Methodology of experiments releasing multiple outputs'), the paper provides details on the numerical setup: 'We set the number of samples L = 5 105. We draw M = 50 times (Xk,k )1 k L,1 k M iid Unif([0, 1]). Then, we run the procedure and the estimates are averaged: elα(h) = 1 M(α 1) k =1 \|h (Xk,k )\|1 α. We repeat the process for different values of d and . We set σθ = 1.'