reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking Aleatoric and Epistemic Uncertainty

Authors: Freddie Bickford Smith, Jannik Kossen, Eleanor Trollope, Mark Van Der Wilk, Adam Foster, Tom Rainforth

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Figure 4 we demonstrate that the approximation EIGθ IGz(y + 1: ) from Proposition 5 can be coarse. These results were produced with extremely simple setups within which we can perform exact inference and we are sure to recover the true data-generating process in the limit of infinite data (see Appendix C for details). We therefore know that the estimation error is due to a failure of the model to accurately simulate future data, which in turn is due to n being finite. Figure 5 BALD outperforms predictive entropy as a data-acquisition objective in active learning, even though BALD tends to be a worse estimator of long-run predictive information gain in the setups studied. These results were produced using experimental setups described in Bickford Smith et al (2023; 2024). Appendix C Implementation details Code to generate Figures 3 and 4 is available at github.com/fbickfordsmith/rethinking-aleatoric-epistemic.
Researcher Affiliation	Academia	1University of Oxford. Correspondence to Freddie Bickford Smith <EMAIL>.
Pseudocode	No	The paper describes methods and concepts using mathematical notation and prose but does not include any clearly labeled pseudocode or algorithm blocks in a structured format.
Open Source Code	Yes	Code to generate Figures 3 and 4 is available at github.com/fbickfordsmith/rethinking-aleatoric-epistemic.
Open Datasets	Yes	Figure 5 BALD outperforms predictive entropy as a data-acquisition objective in active learning... These results were produced using experimental setups described in Bickford Smith et al (2023; 2024) using Curated MNIST and Coarse Image Net.
Dataset Splits	No	The paper mentions datasets like Curated MNIST and Coarse Image Net and describes sampling data for synthetic cases (e.g., 'sample four datasets, y1:n, with yi ptrain(y) and n (1, 10, 100, 1000)'). However, it does not explicitly provide specific train/test/validation splits (percentages or counts) or reference standard splits within the main text for reproducibility.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) required to replicate the experiments.
Experiment Setup	Yes	C.1 Figure 3 We consider predicting an output, z R, corresponding to an input, x R... We use this to compute a Gaussian-process predictive posterior, pn(z\|x) = p(z\|x, y1:n), based on a generative model comprising a Gaussian likelihood function, p(z\|x, θ) = Normal(z\|θ(x), σ2), where σ = 0.1, and a Gaussian-process prior, θ GP(0, k), where k(x, x ) = exp( (x x )2/2). C.2 Figure 4 In the discrete case we have y {0, 1}, data generated from ptrain(y) = Bernoulli(y\|η = 0.5)... In the continuous case we have y R, data generated from ptrain(y) = Normal(y\|µ = 1, σ2 = 1)...