reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Feature Clipping for Uncertainty Calibration

Authors: Linwei Tao, Minjing Dong, Chang Xu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments on datasets such as CIFAR-10, CIFAR-100, and Image Net, and models including CNNs and transformers, demonstrate that FC consistently enhances calibration performance.
Researcher Affiliation	Academia	1University of Sydney 2City University of Hong Kong EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the feature clipping method with a mathematical formula: x = max(min(x, c), c) (1), but it does not present this or any other procedure in a structured pseudocode or algorithm block.
Open Source Code	Yes	Code https://github.com/Linwei94/AAAI2025-FC.git
Open Datasets	Yes	We evaluate our methods on various deep neural networks (DNNs), including Res Net (He et al. 2016), Wide-Res Net (Zagoruyko and Komodakis 2016), Dense Net (Huang et al. 2017), Mobile Net (Howard et al. 2017), and Vi T (Dosovitskiy et al. 2020), using the CIFAR10, CIFAR-100 (Krizhevsky, Hinton et al. 2009), and Image Net-1K (Deng et al. 2009) datasets to assess the effectiveness of feature clipping.
Dataset Splits	Yes	We use the Expected Calibration Error (ECE) and accuracy as our primary metrics for evaluation. Additionally, we incorporate Adaptive ECE, a variant of ECE, which groups samples into bins of equal sizes to provide a balanced evaluation of calibration performance. For both ECE and Adaptive ECE, we use bin size at 15. We also measure the influence of calibration methods on prediction accuracy. (...) The optimal c is determined on the validation set, included in brackets. (...) The total number of samples is 50,000.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments.
Software Dependencies	No	Pre-trained weights for post hoc calibration evaluation are provided by Py Torch.torchvision. Pre-trained weights trained by other train-time calibration methods are provided by Mukhoti et al. (2020). (This mentions a library but no version number.)
Experiment Setup	Yes	For all TS-based methods, we determine the temperature by tuning the hyperparameter on the validation set to minimize the Negative Log Likelihood (NLL). To maintain consistency with TS, we also determine the optimal clipping threshold c on the validation set by minimizing the NLL. (...) For training-time calibration methods, we include training with Brier loss (Brier 1950), label smoothing (M uller et al. 2019) with a smoothing factor of 0.05, FLSD-53 (Mukhoti et al. 2020) using the same γ scheduling scheme as in (Mukhoti et al. 2020), and Dual Focal Loss (Tao et al. 2023b). Detailed settings are following the settings in (Mukhoti et al. 2020). All models are trained with the same training recipe, which is included in the Appendix.