reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

When to retrain a machine learning model

Authors: Florence Regol, Leo Schwinn, Kyle Sprague, Mark Coates, Thomas Markovich

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments addressing classification tasks show that the method consistently outperforms existing baselines on 7 datasets. ... 5. Experiments Evaluation Metrics The performance of a retraining decision method is evaluated based on both the average performance and the total retraining cost. ... Table 1. AUC of the combined performance/retraining cost metric Cα(θ), computed over a range of α values, for all datasets. ... Ablation study: Importance of uncertainty ... Sensitivity study: Robustness to wrong α
Researcher Affiliation	Collaboration	1Mc Gill University, Canada 2Block, Toronto, Canada 3Technical University of Munich, Germany. Correspondence to: Florence Regol <EMAIL>.
Pseudocode	No	The paper describes the methodology using mathematical equations and descriptive text, but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for the methodology or a link to a code repository.
Open Datasets	Yes	We present results on synthetic and real datasets. For the real datasets, we use datasets with a timestamp for each sample and partition the data in time to create a sequence of datasets D0,D1,.... ... (ii) the airplane dataset (Gomes et al., 2017), ... (iii) yelp CHI (Dou et al., 2020), ... and (iv) epicgames (Ozmen et al., 2024), ... i Wild Cam (Beery et al., 2020) ... For the synthetic dataset, we follow Mahadevan & Mathioudakis (2024) to generate two 2D datasets with covariate shift (Gauss) and concept drift (circles) (Pesaranghader et al., 2016).
Dataset Splits	Yes	For the real datasets, we use datasets with a timestamp for each sample and partition the data in time to create a sequence of datasets D0,D1,.... For each trial, we sample a different sequence of length w + T within the complete dataset sequence available. ... We use a similar setup to the one followed in our experiment, setting the offline window size w = 7, evaluating over an online phase of T = 8 steps, and presenting results over 10 trials (See table 11).
Hardware Specification	Yes	Our architecture involves using a pretrained vision model, ... Training was conducted using 4 H100 GPUs for 2 days.
Software Dependencies	No	For µϕ(ri,j), we use a linear regression model, Elastic Net CV (Zou & Hastie, 2005), from the scikit-learn library. All other optimization parameters are set to default choices from the scikit learn libraries. ... We follow Mahadevan & Mathioudakis (2024) and use the Sklearn Multiflow library version (Montiel et al., 2018) of the airplane dataset. ... pretrained vision models made available from timm
Experiment Setup	Yes	We set the confidence threshold of our UPF algorithm to δ = 95%, as it is a standard value used for confidence intervals. For µϕ(ri,j), we use a linear regression model, Elastic Net CV (Zou & Hastie, 2005), from the scikit-learn library. All other optimization parameters are set to default choices from the scikit learn libraries. ... The fine-tuning process uses the Adam optimizer with a fixed learning rate of 10^-4 and a weight decay parameter of 10^-5.