reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Testing Whether a Learning Procedure is Calibrated

Authors: Jon Cockayne, Matthew M. Graham, Chris J. Oates, T. J. Sullivan, Onur Teymur

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A hypothesis-testing framework is developed in order to assess, using simulation, whether a learning procedure is calibrated. Several vignettes are presented to illustrate diﬀerent aspects of the framework. ... In Section 3.1, Figure 3 presents 'ks test statistics (left) and p-values (right) for strong and weak calibration of Laplace approximations in the t-distribution example'. In Section 3.2, 'Figure 4 presents empirical cdfs for both abc and noisy abc', and 'Figure 5 presents the ks test statistics and corresponding pvalues'. In Section 3.3, 'Figure 6: Calibration of probabilistic ODE solvers: Samples (blue) from the strong calibration test statistic Ff#µ(µ0,yi)(f(θi))' shows results of simulations.
Researcher Affiliation	Academia	Jon Cockayne EMAIL Mathematical Sciences University of Southampton Highﬁeld Southampton, SO17 1BJ, UK; Matthew M. Graham EMAIL Centre for Advanced Research Computing University College London Gower Street London, WC1E 6BT, UK; Chris J. Oates EMAIL School of Mathematics, Statistics & Physics Newcastle University Newcastle Upon Tyne, NE1 7RU, UK; T. J. Sullivan EMAIL Mathematics Institute and School of Engineering University of Warwick Coventry, CV4 7AL, UK; Onur Teymur EMAIL School of Mathematics, Statistics & Actuarial Science University of Kent Cantebury, CT2 7NZ, UK. All listed affiliations are universities and email domains are academic (.ac.uk, .uk).
Pseudocode	No	The paper describes mathematical definitions, lemmas, theorems, and discussions of methods. There are no explicitly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code	No	The paper mentions external code for the probabilistic ODE solvers tested in Section B: 'The code for Chkrebtii et al. (2016) was taken from git.io/J33l L', 'The code for Teymur et al. (2018) was provided to us by the authors and is not yet publicly released.', 'The code for both Schober et al. (2019) and Tronarp et al. (2019) derives from the comprehensive open-source Python package probnum.', and 'The code for Teymur et al. (2021) was provided to us by the authors and expected to be made public on full publication of that paper.'. While these refer to code used or developed by the authors for methods being tested, the paper does not provide open-source code specifically for the 'hypothesis-testing framework' or the 'calibration methodology' it describes.
Open Datasets	No	The paper uses simulated data generated from various specified distributions and models, such as 'Student s t data-generating model' (Section 3.1) and 'g-and-k distribution' (Section 3.2), and a 'Lotka Volterra ode' (Section 3.3). These are generative models for simulation rather than explicitly referenced publicly available datasets with access information.
Dataset Splits	Yes	A strategy to select a suitable test function f is therefore required. Following a generic approach to goodness-of-ﬁt testing, one way to proceed is to consider splitting the collection of simulated parameter-dataset pairs into two disjoint sets: S1 := {(θi, yi)}s i=1, S2 := {(θi, yi)}S i=s+1. The ﬁrst subset S1 can be used to identify a suitable test function f, after which a goodness-of-ﬁt test can be conducted using f and S2. The independence of S1 and S2 ensures that a test conducted in this way is valid. To select a suitable test function, one ﬁrst identiﬁes a suﬃciently small subset Fs FΘ of test functions and, for each f Fs, a univariate goodness-of-ﬁt test is performed using S1. The element of Fs that gives rise to the strongest evidence against the null hypothesis, based on S1, is selected.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments. It only mentions general computational processes like 'performing full Bayesian inference' or 'computational convenience'.
Software Dependencies	No	The paper mentions 'the comprehensive open-source Python package probnum' but does not specify a version number for it. Other mentions of software refer to specific algorithms or models without version details, such as 'probabilistic iterative method' or 'Gaussian ﬁltering algorithms'.
Experiment Setup	Yes	In Section 3.1, 'Laplace approximations were computed for 10^6 realisations from the hierarchical model...for each of ν {1, 2, . . . 20} with N = 5 and for each of N {1, 2, . . . 20} with ν = 3'. In Section 3.2, for approximate Bayesian computation, 'rejection sampling was used to generate M samples {θm i }M m=1 from the distributional output...for tolerances ϵ {1, 2, . . . 10}'. In Section 3.3 for ODE solvers, specific settings are provided: 'the step-size was set at h = 0.1', 'nsolves = 100, N = 100, nevalpoints = 500, lambda = 0.08 and alpha = 1', 'step-size h = 0.5 and overall scaling parameter α = 0.3', 'algo order, which we set to 3', 'h {0.1, 0.2, 0.4}'. These are concrete experimental configuration details.