reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Calibrated Selective Classification

Authors: Adam Fisch, Tommi S. Jaakkola, Regina Barzilay

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.1... We demonstrate consistent empirical reductions in selective calibration error metrics... across multiple tasks and datasets.
Researcher Affiliation	Academia	Adam Fisch EMAIL Tommi Jaakkola EMAIL Regina Barzilay EMAIL Computer Science and Artificial Intelligence Laboratory (CSAIL) Massachusetts Institute of Technology, Cambridge, MA, 02142, USA.
Pseudocode	Yes	Algorithm 1 Robust training for calibrated selective classification.
Open Source Code	Yes	Our code is available at https://github.com/ajfisch/calibrated-selective-classification.
Open Datasets	Yes	The CIFAR-10 dataset (Krizhevsky, 2012)... The Image Net dataset (Deng et al., 2009)... data and models from Mikhael et al. (2022), all of the which was subject to IRB approval (including usage for this study). The base model for f is a 3D CNN trained on scans from the National Lung Screening Trial (NLST) data (Aberle et al., 2011).
Dataset Splits	Yes	The CIFAR-10 dataset... has 50k images for training and 10k for testing. We remove 5k images each from the training set for validation and perturbation datasets for training g... We then use a separate split of 6,282 scans from the NLST data to train g... We evaluate our selective classifier on new set of 1,337 scans obtained from Massachusetts General Hospital (MGH).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	All models are trained in Py Torch (Paszke et al., 2019), while high-level input features ( 4.5) for g are computed with sklearn (Pedregosa et al., 2011). The paper mentions software but does not specify version numbers for PyTorch or scikit-learn.
Experiment Setup	Yes	We compute the S-MMCE with samples of size 1024 (of examples with the same perturbation t T applied), with a combined batch size of m = 32. We train all models for 5 epochs, with 50k samples (where one sample is a batch D of 1024 perturbed examples) per epoch ( 7.8k updates total at the combined batch size of 32). The loss hyper-parameters λ1 and λ2 were set to \|D\| 0.5 = 1/32 and 1e-2 \|D\| 1 1e-5, respectively. The κ in our top-κ calibration error loss is set to 4.