reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Simulation-based Bayesian Inference from Privacy Protected Data

Authors: Yifei Xiong, Nianqiao Ju, Sanguo Zhang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate our methods on discrete time-series data under an infectious disease model and with ordinary linear regression models. Illustrating the privacy-utility trade-off, our experiments and analysis demonstrate the necessity and feasibility of designing valid statistical inference procedures to correct for biases introduced by the privacy-protection mechanisms.
Researcher Affiliation	Academia	Yifei Xiong EMAIL Department of Statistics Purdue University Nianqiao Phyllis Ju EMAIL Department of Statistics Purdue University Sanguo Zhang EMAIL School of Mathematical Sciences University of Chinese Academy of Sciences
Pseudocode	Yes	Algorithm 1 Sequential private-data posterior estimation (SPPE) ... Algorithm 2 Sequential private-data likelihood estimation (SPLE) ... Algorithm 3 Sequential Monte Carlo Approximate Bayesian Computation (SMC-ABC)
Open Source Code	Yes	The code is available on Git Hub1. 1https://github.com/Yifei-Xiong/Simulation-based-Bayesian-Inference-from-Privacy-Protected-Data
Open Datasets	Yes	We apply our privacy mechanism and inference methods to several real infectious disease outbreaks: influenza, Ebola, and COVID-19. ... influenza outbreak. We utilized the dataset from a boarding school, obtained from https://search.r-project.org/CRAN/refmans/epimdr/html/flu.html. ... Ebola outbreak in West Africa, 2014. ... The dataset source is from https://apps.who.int/gho/data/node.ebola-sitrep. ... COVID-19. ... See https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/state/nevada/county/clark-county/.
Dataset Splits	Yes	In each round of training, we randomly select 5% of the newly generated samples as validation data.
Hardware Specification	Yes	Our numerical experiments were conducted on a computer equipped with four Ge Force RTX 2080 Ti graphics cards and a pair of 14-core Intel E5-2690 v4 CPUs.
Software Dependencies	No	The paper mentions 'Pytorch package in Python' but does not specify version numbers for either Python or Pytorch, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	We employed neural spline flows (Durkan et al., 2019) as the conditional density estimator, consisting of 8 layers. ... each layer consists of two residual blocks with 50 units and Re LU activation function, with 10 bins in each monotonic piecewise rational-quadratic transforms, and the tail bound was set to 5. ... In the training process, the number of samples simulated in each round is N = 1000 and there are R = 10 rounds in total. ... we stop training if the value of loss on validation data does not decrease after 20 epochs in a single round. For stochastic gradient descent optimizer, we choose the Adam (Kingma & Ba, 2014) with the batchsize of 100, the learning rate of 5 10 4 and the weight decay is 10 4.