Testing Whether a Learning Procedure is Calibrated

Authors: Jon Cockayne, Matthew M. Graham, Chris J. Oates, T. J. Sullivan, Onur Teymur

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A hypothesis-testing framework is developed in order to assess, using simulation, whether a learning procedure is calibrated. Several vignettes are presented to illustrate different aspects of the framework. ... In Section 3.1, Figure 3 presents 'ks test statistics (left) and p-values (right) for strong and weak calibration of Laplace approximations in the t-distribution example'. In Section 3.2, 'Figure 4 presents empirical cdfs for both abc and noisy abc', and 'Figure 5 presents the ks test statistics and corresponding pvalues'. In Section 3.3, 'Figure 6: Calibration of probabilistic ODE solvers: Samples (blue) from the strong calibration test statistic Ff#µ(µ0,yi)(f(θi))' shows results of simulations.
Researcher Affiliation Academia Jon Cockayne EMAIL Mathematical Sciences University of Southampton Highfield Southampton, SO17 1BJ, UK; Matthew M. Graham EMAIL Centre for Advanced Research Computing University College London Gower Street London, WC1E 6BT, UK; Chris J. Oates EMAIL School of Mathematics, Statistics & Physics Newcastle University Newcastle Upon Tyne, NE1 7RU, UK; T. J. Sullivan EMAIL Mathematics Institute and School of Engineering University of Warwick Coventry, CV4 7AL, UK; Onur Teymur EMAIL School of Mathematics, Statistics & Actuarial Science University of Kent Cantebury, CT2 7NZ, UK. All listed affiliations are universities and email domains are academic (.ac.uk, .uk).
Pseudocode No The paper describes mathematical definitions, lemmas, theorems, and discussions of methods. There are no explicitly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code No The paper mentions external code for the probabilistic ODE solvers tested in Section B: 'The code for Chkrebtii et al. (2016) was taken from git.io/J33l L', 'The code for Teymur et al. (2018) was provided to us by the authors and is not yet publicly released.', 'The code for both Schober et al. (2019) and Tronarp et al. (2019) derives from the comprehensive open-source Python package probnum.', and 'The code for Teymur et al. (2021) was provided to us by the authors and expected to be made public on full publication of that paper.'. While these refer to code used or developed by the authors for methods being *tested*, the paper does not provide open-source code specifically for the 'hypothesis-testing framework' or the 'calibration methodology' it describes.
Open Datasets No The paper uses simulated data generated from various specified distributions and models, such as 'Student s t data-generating model' (Section 3.1) and 'g-and-k distribution' (Section 3.2), and a 'Lotka Volterra ode' (Section 3.3). These are generative models for simulation rather than explicitly referenced publicly available datasets with access information.
Dataset Splits Yes A strategy to select a suitable test function f is therefore required. Following a generic approach to goodness-of-fit testing, one way to proceed is to consider splitting the collection of simulated parameter-dataset pairs into two disjoint sets: S1 := {(θi, yi)}s i=1, S2 := {(θi, yi)}S i=s+1. The first subset S1 can be used to identify a suitable test function f, after which a goodness-of-fit test can be conducted using f and S2. The independence of S1 and S2 ensures that a test conducted in this way is valid. To select a suitable test function, one first identifies a sufficiently small subset Fs FΘ of test functions and, for each f Fs, a univariate goodness-of-fit test is performed using S1. The element of Fs that gives rise to the strongest evidence against the null hypothesis, based on S1, is selected.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments. It only mentions general computational processes like 'performing full Bayesian inference' or 'computational convenience'.
Software Dependencies No The paper mentions 'the comprehensive open-source Python package probnum' but does not specify a version number for it. Other mentions of software refer to specific algorithms or models without version details, such as 'probabilistic iterative method' or 'Gaussian filtering algorithms'.
Experiment Setup Yes In Section 3.1, 'Laplace approximations were computed for 10^6 realisations from the hierarchical model...for each of ν {1, 2, . . . 20} with N = 5 and for each of N {1, 2, . . . 20} with ν = 3'. In Section 3.2, for approximate Bayesian computation, 'rejection sampling was used to generate M samples {θm i }M m=1 from the distributional output...for tolerances ϵ {1, 2, . . . 10}'. In Section 3.3 for ODE solvers, specific settings are provided: 'the step-size was set at h = 0.1', 'nsolves = 100, N = 100, nevalpoints = 500, lambda = 0.08 and alpha = 1', 'step-size h = 0.5 and overall scaling parameter α = 0.3', 'algo order, which we set to 3', 'h {0.1, 0.2, 0.4}'. These are concrete experimental configuration details.