Confidence-based Estimators for Predictive Performance in Model Monitoring
Authors: Juhani Kivimäki, Jukka K. Nurminen, Jakub Białek, Wojtek Kuberski
JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We complement our theoretical results with empirical experiments, comparing AC against more complex estimators in a monitoring setting under covariate shift. We conduct our experiments using synthetic datasets, which allow for full control over the nature of the shift. Our experiments with binary classifiers show that the AC method is able to beat other estimators in many cases. |
| Researcher Affiliation | Collaboration | Juhani Kivim aki EMAIL Jukka K. Nurminen EMAIL University of Helsinki Jakub Bia lek EMAIL Wojtek Kuberski EMAIL Nanny ML |
| Pseudocode | No | The paper describes methods using mathematical formulations and textual explanations, but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 4.1 'Estimating Predictive Accuracy' and Section 4.3 'Estimating the Confusion Matrix for Failure Prediction' detail procedures without using pseudocode. |
| Open Source Code | Yes | Code is publicly available at https://github.com/Juhani K/AC_trials |
| Open Datasets | No | We conduct two experiments with synthetic data. ... As stated in Section 2, calibration error is caused by the confidence scores not aligning with empirical probabilities. We can try to minimize this discrepancy by first creating a set of confidence scores. ... We chose to create our simulated set of confidence scores by drawing samples from a mixture of three Beta distributions. |
| Dataset Splits | Yes | In both scenarios, we trained the models with 100,000 samples, of which 80% were easy and 20% were hard. Using the same ratio, we drew 25,000 additional samples to train the calibration mappings for each model and 25,000 additional samples for the setup required by the DOC-Feat and ATC methods. Since there was no shortage of data, we used the non-parametric Isotonic Regression (Zadrozny & Elkan, 2002) to derive the calibration mappings. It is generally considered to produce good quality calibration mappings but is also known to sometimes overfit with smaller datasets (Kull et al., 2017). In each trial, we created a test dataset of 25,000 samples, with an increasing portion of hard-to-predict samples, as described in Table 3. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory amounts used for running its experiments. It mentions using implementations from scikit-learn, XGBoost, and Light-GBM, but not the hardware these were run on. |
| Software Dependencies | No | In this work, we leverage the algorithm presented by Hong (2013), which is based on the Fast Fourier Transform and implemented in the poibin1 python library. ... using implementations provided by the scikit-learn library 4, as well as XGBoost (XGB) 5, and Light-GBM (LGBM) 6. Although libraries are mentioned, no specific version numbers are provided for Python, scikit-learn, XGBoost, LGBM, or the poibin library. |
| Experiment Setup | No | All models were trained with the default parameter settings of their respective implementations. ... In each trial, we drew 500 samples with replacement from the sampled test dataset as a simulated batch of incoming data. |