On Consistent Bayesian Inference from Synthetic Data
Authors: Ossi Räisä, Joonas Jälkö, Antti Honkela
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that mixing posterior samples obtained separately from multiple large synthetic data sets, that are sampled from a posterior predictive, converges to the posterior of the downstream analysis under standard regularity conditions when the analyst’s model is compatible with the data provider’s model. We also present several examples showing how the theory works in practice, and showing how Bayesian inference can fail when the compatibility assumption is not met, or the synthetic data set is not significantly larger than the original. Keywords: synthetic data, Bayesian inference, Bernstein-von Mises theorem, differential privacy |
| Researcher Affiliation | Academia | Ossi Räisä EMAIL Joonas Jälkö EMAIL Antti Honkela EMAIL Department of Computer Science University of Helsinki P.O. Box 68 (Pietari Kalmin katu 5) 00014 University of Helsinki, Finland |
| Pseudocode | No | The paper describes methodologies and algorithms like NAPSU-MQ and NUTS, but it does not present any structured pseudocode or algorithm blocks in the main text. |
| Open Source Code | Yes | Our code is available under an open-source license.1 1. https://github.com/DPBayes/NAPSU-MQ-bayesian-downstream-experiments |
| Open Datasets | Yes | To test our theory on real data, we used the UCI Adult data set (Kohavi and Becker, 1996) setting that was used to test NAPSU-MQ (Räisä et al., 2023). |
| Dataset Splits | No | The paper mentions using a toy data set of nX = 2000 samples and the UCI Adult dataset with nX = 46043 datapoints, and states 'We take bootstrap samples of the data to simulate draws from a population.' However, it does not provide specific train/test/validation splits by percentage, count, or a reference to a predefined split. |
| Hardware Specification | No | The authors wish to thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources. This statement refers to general computational resources but does not provide specific hardware details (e.g., GPU/CPU models, memory specifications). |
| Software Dependencies | No | The paper mentions using 'NUTS (Hoffman and Gelman, 2014)', 'DP-GLM from Kulkarni et al. (2021)', 'synthpop (Nowok et al., 2016)', 'DP-SGD (Rajkumar and Agarwal, 2012; Song et al., 2013; Abadi et al., 2016), specifically DP-Adam', and 'Optuna library (Akiba et al., 2019)'. While various software components are named, specific version numbers for these are not provided. |
| Experiment Setup | Yes | For NAPSU-MQ, we use the hyperparameters of Räisä et al. (2023), except we used NUTS (Hoffman and Gelman, 2014) with 200 warmup samples and 500 kept samples per chain for ϵ {0.5, 1}, and 1500 kept samples per chain for ϵ = 0.1, as the posterior sampling algorithm. The NAPSU-MQ prior is N(0, 102I), and the summary is the single 3-way marginal query over all three variables. The hyperparameters of DP-GLM are the L2-norm upper bound R for the covariates of the logistic regression, a coefficient norm upper bound s, and the parameters of the posterior sampling algorithm DP-GLM uses. We set R = 2 so that the covariates do not get clipped, and set s = 5 after some preliminary runs. The posterior sampling algorithm is NUTS (Hoffman and Gelman, 2014) with 1000 warmup samples and 1000 kept samples from 4 parallel chains. The prior for the downstream Bayesian logistic regression is N(0, 10), i.i.d. for each coefficient. The privacy parameters are ϵ {0.25, 0.5, 1}, and δ = n 2 X 4.7 10 10. DPVI runs DP-SGD (Rajkumar and Agarwal, 2012; Song et al., 2013; Abadi et al., 2016), specifically DP-Adam, under the hood, so it inherits the clip bound, learning rate, number of iterations, and subsampling (without replacement) ratio hyperparameters from DP-SGD. We tuned these with the Optuna library (Akiba et al., 2019), using the bounds [0.1, 50] for the clip bound, [10 4, 10 1] for the learning rate, [104, 105] for the number of iterations and [0.001, 1] for the subsampling ratio. |