Robust Simulation-Based Inference under Missing Data via Neural Processes

Authors: Yogesh Verma, Ayush Bharti, Vikas Garg

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical results on SBI benchmarks show that our approach provides robust inference outcomes compared to standard baselines for varying levels of missing data. Moreover, we demonstrate the merits of our imputation model on two real-world bioactivity datasets (Adrenergic and Kinase assays). Code is available at https://github.com/Aalto-QuML/RISE.
Researcher Affiliation Collaboration Yogesh Verma, Ayush Bharti Department of Computer Science, Aalto University EMAIL Vikas Garg Yai Yai Ltd and Aalto University EMAIL
Pseudocode Yes Algorithm 1 RISE (training) Require: Simulator p( | θ), prior p(θ), iterations niter, missingness degree ε 1: Initialize parameters ϕ, φ of RISE 2: for k = 1, . . . , niter do 3: Sample (x, θ) p( | θ)p(θ) 4: Create mask s wrt ε and MCAR/MAR/MNAR 5: Compute ℓRISE using Equation (6) 6: ϕ, φ optimize(ℓRISE ; ϕ, φ) 7: end for
Open Source Code Yes Code is available at https://github.com/Aalto-QuML/RISE.
Open Datasets Yes Moreover, we demonstrate the merits of our imputation model on two real-world bioactivity datasets (Adrenergic and Kinase assays). [...] The task is to predict and impute bioactivity data on Adrenergic receptor assays (Whitehead et al., 2019) and Kinase assays (Martin et al., 2017) from the field of drug discovery.
Dataset Splits No The paper describes how missingness is introduced in the datasets (e.g., "We take ε {10%, 25%, 60%} to test performance from low to high missingness scenarios"). However, it does not explicitly provide details about standard train/test/validation splits for the benchmark datasets or the real-world bioactivity datasets used in the experiments. It mentions simulating data and a simulation budget, which is related to data generation rather than splitting existing datasets for evaluation.
Hardware Specification Yes Table 7 describes the time (in seconds) per epoch to train different models on a single V100 GPU.
Software Dependencies No RISE is implemented in Py Torch (Paszke et al., 2019) and utilizes the same training configuration as the competing baselines (see Appendix A.4.4 for details). Our inference model implementations are based on publicly available code from the sbi library https://github.com/mackelab/sbi. While PyTorch and sbi library are mentioned, specific version numbers for these software dependencies are not provided in the paper.
Experiment Setup Yes Throughout our experiments, we maintained a consistent batch size of 50 and a fixed learning rate of 5 10 4. We set a simulation budget of n = 1000 for all the SBI experiments, and take 1000 samples from the posterior distributions to compute the MMD, C2ST and NLPP.