reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Privacy-Preserving Energy-Based Generative Models for Marginal Distribution Protection

Authors: Robert E. Tillman, Tucker Balch, Manuela Veloso

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate this approach using financial and healthcare datasets and demonstrate that the resulting learnt generative models produce high fidelity synthetic data while preserving privacy. We also show that PPEMs can incorporate both α-LMDP and DP in contexts where both forms of privacy are required. ... Using credit card data and electronic healthcare records, we empirically demonstrate that PPEMs produce high fidelity synthetic data while preserving privacy.
Researcher Affiliation	Industry	Robert E. Tillman Optum AI Labs (United Health Group) Tucker Balch J.P. Morgan Chase AI Research Manuela Veloso J.P. Morgan Chase AI Research
Pseudocode	Yes	Pseudocode for training and sampling is provided in Appendix A and proofs are provided in Appendix B. Code is also provided in the attached supplement.
Open Source Code	Yes	Pseudocode for training and sampling is provided in Appendix A and proofs are provided in Appendix B. Code is also provided in the attached supplement.
Open Datasets	Yes	We next apply PPEMs to real financial and healthcare datasets that have previously been used to benchmark privacy-preserving generative models: the Kaggle credit card fraud dataset (Pozzolo et al., 2015), used as the primary evaluation dataset for PATE-GAN, consists of 28 factors used to predict whether a transaction is fraudulent and the transaction amount; the MIMIC-III critical care electronic healthcare record (EHR) dataset (Johnson et al., 2016) consists of binary indicators for diagnoses patients received. ... The license for this dataset is available at https://opendatacommons.org/licenses/dbcl/1-0/. ... The license for this dataset is available at https://physionet.org/content/mimiciii/view-license/1.4/.
Dataset Splits	No	The paper does not explicitly provide information on dataset splits (e.g., training, validation, test percentages or counts). It only mentions using a minibatch size of 128 for training.
Hardware Specification	Yes	All experiments were run using a single NVIDIA T4 GPU.
Software Dependencies	No	The paper mentions using a public implementation for DP-GAN and PATE-GAN, but does not list specific software dependencies (e.g., Python, PyTorch, CUDA versions) used for their own PPEM models.
Experiment Setup	Yes	D.1 Hyperparameters Below are the hyperparameters used in all experiments with PPEM models: m = 10 λα = 10 λD = 1 Training epochs = 100 Minibatch size = 128 Number of energy model iterations per generator iteration = 5 MLP layer dimensions (all networks) = 256 Latent dimensions (both generators) = 128 Generators learning rate = 5e-4 Energy models learning rate = 1e-3 α-level = 0.05 (ϵ, δ) = (1, n-1)