reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Selective Prediction via Training Dynamics

Authors: Stephan Rabanser, Anvith Thudi, Kimia Hamidieh, Adam Dziedzic, Israfil Bahceci, Akram Bin Sediq, HAMZA SOKUN, Nicolas Papernot

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental evaluation on image classification, regression, and time series problems shows that our method beats past state-of-the-art accuracy/utility trade-offs on typical selective prediction benchmarks. We perform a comprehensive set of empirical experiments on established selective prediction benchmarks spanning over classification, regression, and time series prediction problems (Section 4).
Researcher Affiliation	Collaboration	Stephan Rabanser EMAIL University of Toronto & Vector Institute, Anvith Thudi EMAIL University of Toronto & Vector Institute, Kimia Hamidieh EMAIL Massachusetts Institute of Technology, Adam Dziedzic EMAIL CISPA Helmholtz Center for Information Security, Israfil Bahceci EMAIL Ericsson, Akram Bin Sediq EMAIL Ericsson, Hamza Sokun EMAIL Ericsson, Nicolas Papernot EMAIL University of Toronto & Vector Institute
Pseudocode	Yes	Algorithm 1: SPTD for classification; Algorithm 2: SPTD for regression; Algorithm 3: SPTD for time series forecasting
Open Source Code	No	The paper does not contain an explicit statement about the release of source code or a link to a code repository.
Open Datasets	Yes	We evaluate SPTD on image dataset benchmarks that are common in the selective classification literature: CIFAR-10/CIFAR-100 (Krizhevsky et al., 2009), Stanford Cars (Krause et al., 2013), and Food101 (Bossard et al., 2014)... Our experimental suite for regression considers the following datasets: California housing dataset (Pace & Barry, 1997), the concrete strength dataset (Yeh, 2007), and the fish toxicity dataset (Ballabio et al., 2019)... As part of our time series experiments, we mainly consider the M4 forecasting competition dataset (Makridakis et al., 2020) which contains time series aggregated at various time intervals (e.g., hourly, daily). In addition, we also provide experimentation on the Hospital dataset (Hyndman, 2015).
Dataset Splits	Yes	We split all datasets into 80% training and 20% test sets after a random shuffle.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or detailed computer specifications used for running the experiments. It mentions 'GPU clusters' in the acknowledgements but no specific models.
Software Dependencies	No	The paper mentions using 'Gluon TS time series framework' and 'Deep AR model' but does not specify their version numbers or any other software dependencies with version details.
Experiment Setup	Yes	For each dataset, we train a deep neural network following the Res Net-18 architecture (He et al., 2016) and checkpoint each model after processing 50 mini-batches of size 128. All models are trained over 200 epochs (400 epochs for Stanford Cars) using the SGD optimizer with an initial learning rate of 10-2, momentum 0.9, and weight decay 10-4. Across all datasets, we decay the learning rate by a factor of 0.5 in 25-epoch intervals... We train a fully connected neural network with layer dimensionalities D 10 7 4 1. Optimization is performed using full-batch gradient descent using the Adam optimizer with learning rate 10-2 over 200 epochs and weight decay 10-2.