On Consistent Bayesian Inference from Synthetic Data

Authors: Ossi Räisä, Joonas Jälkö, Antti Honkela

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that mixing posterior samples obtained separately from multiple large synthetic data sets, that are sampled from a posterior predictive, converges to the posterior of the downstream analysis under standard regularity conditions when the analyst’s model is compatible with the data provider’s model. We also present several examples showing how the theory works in practice, and showing how Bayesian inference can fail when the compatibility assumption is not met, or the synthetic data set is not significantly larger than the original. Keywords: synthetic data, Bayesian inference, Bernstein-von Mises theorem, differential privacy
Researcher Affiliation Academia Ossi Räisä EMAIL Joonas Jälkö EMAIL Antti Honkela EMAIL Department of Computer Science University of Helsinki P.O. Box 68 (Pietari Kalmin katu 5) 00014 University of Helsinki, Finland
Pseudocode No The paper describes methodologies and algorithms like NAPSU-MQ and NUTS, but it does not present any structured pseudocode or algorithm blocks in the main text.
Open Source Code Yes Our code is available under an open-source license.1 1. https://github.com/DPBayes/NAPSU-MQ-bayesian-downstream-experiments
Open Datasets Yes To test our theory on real data, we used the UCI Adult data set (Kohavi and Becker, 1996) setting that was used to test NAPSU-MQ (Räisä et al., 2023).
Dataset Splits No The paper mentions using a toy data set of nX = 2000 samples and the UCI Adult dataset with nX = 46043 datapoints, and states 'We take bootstrap samples of the data to simulate draws from a population.' However, it does not provide specific train/test/validation splits by percentage, count, or a reference to a predefined split.
Hardware Specification No The authors wish to thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources. This statement refers to general computational resources but does not provide specific hardware details (e.g., GPU/CPU models, memory specifications).
Software Dependencies No The paper mentions using 'NUTS (Hoffman and Gelman, 2014)', 'DP-GLM from Kulkarni et al. (2021)', 'synthpop (Nowok et al., 2016)', 'DP-SGD (Rajkumar and Agarwal, 2012; Song et al., 2013; Abadi et al., 2016), specifically DP-Adam', and 'Optuna library (Akiba et al., 2019)'. While various software components are named, specific version numbers for these are not provided.
Experiment Setup Yes For NAPSU-MQ, we use the hyperparameters of Räisä et al. (2023), except we used NUTS (Hoffman and Gelman, 2014) with 200 warmup samples and 500 kept samples per chain for ϵ {0.5, 1}, and 1500 kept samples per chain for ϵ = 0.1, as the posterior sampling algorithm. The NAPSU-MQ prior is N(0, 102I), and the summary is the single 3-way marginal query over all three variables. The hyperparameters of DP-GLM are the L2-norm upper bound R for the covariates of the logistic regression, a coefficient norm upper bound s, and the parameters of the posterior sampling algorithm DP-GLM uses. We set R = 2 so that the covariates do not get clipped, and set s = 5 after some preliminary runs. The posterior sampling algorithm is NUTS (Hoffman and Gelman, 2014) with 1000 warmup samples and 1000 kept samples from 4 parallel chains. The prior for the downstream Bayesian logistic regression is N(0, 10), i.i.d. for each coefficient. The privacy parameters are ϵ {0.25, 0.5, 1}, and δ = n 2 X 4.7 10 10. DPVI runs DP-SGD (Rajkumar and Agarwal, 2012; Song et al., 2013; Abadi et al., 2016), specifically DP-Adam, under the hood, so it inherits the clip bound, learning rate, number of iterations, and subsampling (without replacement) ratio hyperparameters from DP-SGD. We tuned these with the Optuna library (Akiba et al., 2019), using the bounds [0.1, 50] for the clip bound, [10 4, 10 1] for the learning rate, [104, 105] for the number of iterations and [0.001, 1] for the subsampling ratio.