Does Training with Synthetic Data Truly Protect Privacy?

Authors: Yunpeng Zhao, Jie Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To rigorously measure the privacy leakage of empirical methods trained on synthetic data, we use membership inference attacks (Shokri et al., 2017) as a privacy auditing tool. We provide a systematic privacy evaluation on these four training paradigms. For each training paradigm, we interact only with the final model trained on synthetic data, and then determine whether a particular data point was part of the private training dataset. We conduct all experiments on CIFAR-10 (Krizhevsky & Hinton, 2009), as all training methods are scalable to CIFAR-10 and achieve good test accuracy. We report the performance of these methods across three dimensions: privacy leakage (TPR@0.1% FPR), model utility (test accuracy), and efficiency (training time).
Researcher Affiliation Academia Yunpeng Zhao National University of Singapore EMAIL Jie Zhang ETH Zurich EMAIL
Pseudocode No The paper describes various methods (Coreset Selection, Dataset Distillation, Data-Free Knowledge Distillation, Synthetic Data from Fine-Tuned Diffusion Models) using mathematical formulations and textual descriptions, but it does not present any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The source code is available at https://github.com/yunpeng-zhao/ syndata-privacy.
Open Datasets Yes We conduct all experiments on CIFAR-10 (Krizhevsky & Hinton, 2009)... For example, we use CINIC-10 (Darlow et al., 2018), an extension of CIFAR-10 incorporating downsampled Image Net images, for initialization.
Dataset Splits Yes We designate 500 random data points as audit samples on which we evaluate membership inference, and we use mislabeled data as strong canaries to simulate worst case data; the remaining 49,500 samples are always included in every model s training data. For each method, we train 32 shadow models, ensuring that each audit sample is included in the training data of 16 models.
Hardware Specification No The paper mentions 'TESLA (Cui et al., 2023)' and 'TESLA version' in the context of memory for specific methods but does not explicitly state the specific GPU or CPU models or other hardware used for running their experiments. It primarily focuses on software-level details and training protocols rather than hardware specifications.
Software Dependencies No The paper describes various training procedures, optimizers (SGD), and network architectures (ResNet-18, ConvNet) but does not provide specific version numbers for software libraries, frameworks, or programming languages used.
Experiment Setup Yes For the undefended baseline, we employ the same training procedure as described in (Aerni et al., 2024). Concretely, Res Net-18 models are trained using the SGD optimizer with a momentum of 0.9 and a weight decay of 0.0005. We use a batch size of 256 and typical data augmentation techniques, including random horizontal flips and random shifts of up to 4 pixels. The models are optimized over 200 epochs with a base learning rate of 0.1. We employ a linear warm-up of the learning rate during the first epoch, followed by a decay of the learning rate by a factor of 0.2 at epochs 60, 120, and 160. For each method, we train 32 shadow models... For all defenses, we consistently adopt Res Net-18 (He et al., 2016) as the network architecture of shadow models.