reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Risk to Uncertainty: Generating Predictive Uncertainty Measures via Bayesian Estimation

Authors: Nikita Kotelevskii, Vladimir Kondratyev, Martin Takáč, Eric Moulines, Maxim Panov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our method on image datasets by evaluating its performance in detecting out-of-distribution and misclassified instances using the AUROC metric. The experimental results confirm that the measures derived from our framework are useful for the considered downstream tasks. [...] We experimentally evaluate different predictive uncertainty quantification measures from the proposed framework in various tasks. Specifically, we consider out-of-distribution detection and misclassification detection; see Section 6.
Researcher Affiliation	Academia	Nikita Kotelevskii1,2 Vladimir Kondratyev3 Martin Takáˇc1 Éric Moulines3,1 Maxim Panov1 1Department of Machine Learning, MBZUAI, UAE 2CAIT, Skoltech, Russia 3CMAP, École polytechnique, France
Pseudocode	No	The paper does not contain any explicit sections or figures labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	1The source code is publicly available at https://github.com/stat-ml/uncertainty_from_ proper_scoring_rules/.
Open Datasets	Yes	As training (in-distribution) datasets, we consider CIFAR10, CIFAR100 (Krizhevsky, 2009), and Tiny Image Net (Le & Yang, 2015).
Dataset Splits	No	The paper mentions using CIFAR10, CIFAR100, and Tiny Image Net, as well as their noisy versions (CIFAR10-N, CIFAR100-N) and out-of-distribution variants (CIFAR10C, Image Net-O, Image Net-A, Image Net-R). It specifies that original versions are used for misclassification detection. However, it does not explicitly provide percentages, sample counts, or specific methodology for training/validation/test splits for any of these datasets.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions using code from repositories like 'https://github.com/kuangliu/pytorch-cifar' and 'https://github.com/weiaicunzai/pytorch-cifar100', and pre-trained models from 'https://github.com/ENSTA-U2IS-AI/torch-uncertainty'. While these imply the use of PyTorch, specific version numbers for PyTorch, Python, or other libraries are not provided.
Experiment Setup	Yes	We used Res Net18 (He et al., 2016) as the architecture (additional details can be found in Appendix H). [...] The training procedure consisted of 200 epochs with a cosine annealing learning rate. For an optimizer, we use SGD with momentum and weight decay. [...] For CIFAR100-based datasets, we used code from this repository: https://github.com/weiaicunzai/pytorch-cifar100. The training procedure consisted of 200 epochs with learning rate decay at particular milestones: [60, 120, 160]. For an optimizer, we use SGD with momentum and weight decay.