reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling

Authors: Jinzong Dong, Zhaohui Jiang, Dong Pan, Haoyang Yu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The effectiveness of our calibration method and metric are verified in real-world and simulated data. We believe our exploration of integrating prior distributions with empirical data will guide the development of better-calibrated models, contributing to trustworthy AI.
Researcher Affiliation	Academia	Jinzong Dong, Zhaohui Jiang*, Dong Pan, Haoyang Yu School of Automation, Central South University EMAIL
Pseudocode	Yes	Algorithm 1: Estimating calibration curve. Algorithm 2: Estimating TCE. Algorithm 3: Simulating dataset with binomial process.
Open Source Code	Yes	Code https://github.com/Neuro Dong/TCEbpm Extended version https://arxiv.org/abs/2412.10658
Open Datasets	Yes	In Appendix B.1, ten publicly available logit datasets (i.e., real-world datasets) and five true distributions (named D1, D2, , and D5, respectively) were selected for the experiments. Res Net110 Cifar10 Wide-Res Net32 s logits dataset of Cifar100 and Dense Net162 logits dataset of Image Net are shown in Appendix B.5.
Dataset Splits	Yes	For the real-world dataset, we selected ten publicly available logit datasets from a previous study (Roelofs et al. 2022)... These datasets consist of logit outputs and true labels from different network architectures (e.g., ResNet110, WideResNet32, DenseNet162) trained on various image classification benchmarks (e.g., CIFAR-10, CIFAR-100, ImageNet).
Hardware Specification	No	The paper does not provide specific hardware details used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers.
Experiment Setup	Yes	Implementation Details. For Algorithm 1, the optimizer is Adam with a learning rate of 0.01 for 1000 iterations. We set the number of binning schemes B to 100 (where each binning scheme has 20 bins) and the number of bins M to 20 for Histogram Binning, Beta Calibration, Dirichlet Calibration, and Spline Calibration. In the experiments of simulated data, the value of p in Eq. 2 is set to 1. The sampling process in Algorithm 3 requires sampling from Beta(a1, a2) and Binomial(1, P(H=1\|S_hat)). We used the default random number generator in Python.