reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Addressing Misspecification in Simulation-based Inference through Data-driven Calibration

Authors: Antoine Wehenkel, Juan L. Gamella, Ozan Sener, Jens Behrmann, Guillermo Sapiro, Joern-Henrik Jacobsen, Marco Cuturi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results on four synthetic tasks and two real-world problems with groundtruth labels demonstrate that Ro PE outperforms baselines and consistently returns informative and calibrated credible intervals.
Researcher Affiliation	Collaboration	1Apple 2Work done while being at Apple 3ETH Zürich. Correspondence to: Antoine Wehenkel <EMAIL>.
Pseudocode	Yes	Algorithm 1 Posterior Inference using Robust Neural Posterior Estimation (Ro PE)
Open Source Code	No	However, we encourage the reader interested in reproducing our experiments to examine our code directly (a link to the code will be made available in the public version of the paper).
Open Datasets	Yes	We reproduce the cancer and stromal cell development (CS) and the stochastic epidemic model (SIR) benchmarks from Ward et al. (2022). ... We employ one of the light tunnel datasets from Gamella et al. (2025). ... We employ one of the wind tunnel datasets from Gamella et al. (2025).
Dataset Splits	Yes	For all experiments, we compute the LPP and ACAUC on labeled test set containing 2000 pairs (θ, xo). For all methods training on calibration set we keep always keep 20% of the calibration to monitor validation performance and we select the best model based on this metric. ... Ctrain, Cval = Random Split(C, 1/5)
Hardware Specification	Yes	In our experiments, solving the OT optimization for 2000 test examples takes less than a minute on an M1 Mac Book Pro.
Software Dependencies	No	The paper mentions using the OTT library: 'In our experiments, we rely on OTT (Cuturi et al., 2022) to return such a coupling P', but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	For training the NPE, we use a batch size of 100 and a learning factor equal to 1e-4. NPE is trained until convergence. Other parameters are set to default values and should marginally impact the NPE obtained. ... We fine-tune the NCDE with a learning rate equal to 1e-5 for 5000 gradient steps on 80% the full calibration set.