reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Confidence-based Estimators for Predictive Performance in Model Monitoring

Authors: Juhani Kivimäki, Jukka K. Nurminen, Jakub Białek, Wojtek Kuberski

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We complement our theoretical results with empirical experiments, comparing AC against more complex estimators in a monitoring setting under covariate shift. We conduct our experiments using synthetic datasets, which allow for full control over the nature of the shift. Our experiments with binary classifiers show that the AC method is able to beat other estimators in many cases.
Researcher Affiliation	Collaboration	Juhani Kivim aki EMAIL Jukka K. Nurminen EMAIL University of Helsinki Jakub Bia lek EMAIL Wojtek Kuberski EMAIL Nanny ML
Pseudocode	No	The paper describes methods using mathematical formulations and textual explanations, but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 4.1 'Estimating Predictive Accuracy' and Section 4.3 'Estimating the Confusion Matrix for Failure Prediction' detail procedures without using pseudocode.
Open Source Code	Yes	Code is publicly available at https://github.com/Juhani K/AC_trials
Open Datasets	No	We conduct two experiments with synthetic data. ... As stated in Section 2, calibration error is caused by the confidence scores not aligning with empirical probabilities. We can try to minimize this discrepancy by first creating a set of confidence scores. ... We chose to create our simulated set of confidence scores by drawing samples from a mixture of three Beta distributions.
Dataset Splits	Yes	In both scenarios, we trained the models with 100,000 samples, of which 80% were easy and 20% were hard. Using the same ratio, we drew 25,000 additional samples to train the calibration mappings for each model and 25,000 additional samples for the setup required by the DOC-Feat and ATC methods. Since there was no shortage of data, we used the non-parametric Isotonic Regression (Zadrozny & Elkan, 2002) to derive the calibration mappings. It is generally considered to produce good quality calibration mappings but is also known to sometimes overfit with smaller datasets (Kull et al., 2017). In each trial, we created a test dataset of 25,000 samples, with an increasing portion of hard-to-predict samples, as described in Table 3.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory amounts used for running its experiments. It mentions using implementations from scikit-learn, XGBoost, and Light-GBM, but not the hardware these were run on.
Software Dependencies	No	In this work, we leverage the algorithm presented by Hong (2013), which is based on the Fast Fourier Transform and implemented in the poibin1 python library. ... using implementations provided by the scikit-learn library 4, as well as XGBoost (XGB) 5, and Light-GBM (LGBM) 6. Although libraries are mentioned, no specific version numbers are provided for Python, scikit-learn, XGBoost, LGBM, or the poibin library.
Experiment Setup	No	All models were trained with the default parameter settings of their respective implementations. ... In each trial, we drew 500 samples with replacement from the sampled test dataset as a simulated batch of incoming data.