Selective Prediction via Training Dynamics
Authors: Stephan Rabanser, Anvith Thudi, Kimia Hamidieh, Adam Dziedzic, Israfil Bahceci, Akram Bin Sediq, HAMZA SOKUN, Nicolas Papernot
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental evaluation on image classification, regression, and time series problems shows that our method beats past state-of-the-art accuracy/utility trade-offs on typical selective prediction benchmarks. We perform a comprehensive set of empirical experiments on established selective prediction benchmarks spanning over classification, regression, and time series prediction problems (Section 4). |
| Researcher Affiliation | Collaboration | Stephan Rabanser EMAIL University of Toronto & Vector Institute, Anvith Thudi EMAIL University of Toronto & Vector Institute, Kimia Hamidieh EMAIL Massachusetts Institute of Technology, Adam Dziedzic EMAIL CISPA Helmholtz Center for Information Security, Israfil Bahceci EMAIL Ericsson, Akram Bin Sediq EMAIL Ericsson, Hamza Sokun EMAIL Ericsson, Nicolas Papernot EMAIL University of Toronto & Vector Institute |
| Pseudocode | Yes | Algorithm 1: SPTD for classification; Algorithm 2: SPTD for regression; Algorithm 3: SPTD for time series forecasting |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code or a link to a code repository. |
| Open Datasets | Yes | We evaluate SPTD on image dataset benchmarks that are common in the selective classification literature: CIFAR-10/CIFAR-100 (Krizhevsky et al., 2009), Stanford Cars (Krause et al., 2013), and Food101 (Bossard et al., 2014)... Our experimental suite for regression considers the following datasets: California housing dataset (Pace & Barry, 1997), the concrete strength dataset (Yeh, 2007), and the fish toxicity dataset (Ballabio et al., 2019)... As part of our time series experiments, we mainly consider the M4 forecasting competition dataset (Makridakis et al., 2020) which contains time series aggregated at various time intervals (e.g., hourly, daily). In addition, we also provide experimentation on the Hospital dataset (Hyndman, 2015). |
| Dataset Splits | Yes | We split all datasets into 80% training and 20% test sets after a random shuffle. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or detailed computer specifications used for running the experiments. It mentions 'GPU clusters' in the acknowledgements but no specific models. |
| Software Dependencies | No | The paper mentions using 'Gluon TS time series framework' and 'Deep AR model' but does not specify their version numbers or any other software dependencies with version details. |
| Experiment Setup | Yes | For each dataset, we train a deep neural network following the Res Net-18 architecture (He et al., 2016) and checkpoint each model after processing 50 mini-batches of size 128. All models are trained over 200 epochs (400 epochs for Stanford Cars) using the SGD optimizer with an initial learning rate of 10-2, momentum 0.9, and weight decay 10-4. Across all datasets, we decay the learning rate by a factor of 0.5 in 25-epoch intervals... We train a fully connected neural network with layer dimensionalities D 10 7 4 1. Optimization is performed using full-batch gradient descent using the Adam optimizer with learning rate 10-2 over 200 epochs and weight decay 10-2. |