Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling

Authors: Jinzong Dong, Zhaohui Jiang, Dong Pan, Haoyang Yu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of our calibration method and metric are verified in real-world and simulated data. We believe our exploration of integrating prior distributions with empirical data will guide the development of better-calibrated models, contributing to trustworthy AI.
Researcher Affiliation Academia Jinzong Dong, Zhaohui Jiang*, Dong Pan, Haoyang Yu School of Automation, Central South University EMAIL
Pseudocode Yes Algorithm 1: Estimating calibration curve. Algorithm 2: Estimating TCE. Algorithm 3: Simulating dataset with binomial process.
Open Source Code Yes Code https://github.com/Neuro Dong/TCEbpm Extended version https://arxiv.org/abs/2412.10658
Open Datasets Yes In Appendix B.1, ten publicly available logit datasets (i.e., real-world datasets) and five true distributions (named D1, D2, , and D5, respectively) were selected for the experiments. Res Net110 Cifar10 Wide-Res Net32 s logits dataset of Cifar100 and Dense Net162 logits dataset of Image Net are shown in Appendix B.5.
Dataset Splits Yes For the real-world dataset, we selected ten publicly available logit datasets from a previous study (Roelofs et al. 2022)... These datasets consist of logit outputs and true labels from different network architectures (e.g., ResNet110, WideResNet32, DenseNet162) trained on various image classification benchmarks (e.g., CIFAR-10, CIFAR-100, ImageNet).
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes Implementation Details. For Algorithm 1, the optimizer is Adam with a learning rate of 0.01 for 1000 iterations. We set the number of binning schemes B to 100 (where each binning scheme has 20 bins) and the number of bins M to 20 for Histogram Binning, Beta Calibration, Dirichlet Calibration, and Spline Calibration. In the experiments of simulated data, the value of p in Eq. 2 is set to 1. The sampling process in Algorithm 3 requires sampling from Beta(a1, a2) and Binomial(1, P(H=1|S_hat)). We used the default random number generator in Python.