Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Underspecification Presents Challenges for Credibility in Modern Machine Learning

Authors: Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, D. Sculley

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide evidence that underspecfication has substantive implications for practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. ... The second claim is that underspecification is ubiquitous in modern applications of ML, and has substantial practical implications. We support this claim with an empirical study, in which we apply a simple experimental protocol across plausibly deployable deep learning pipelines in computer vision, medical imaging, natural language processing (NLP), and electronic health record (EHR) based prediction.
Researcher Affiliation Collaboration Alexander D Amour EMAIL Katherine Heller EMAIL ... Yian Ma EMAIL Cory Mc Lean EMAIL ... Andrea Montanari EMAIL Zachary Nado EMAIL ... Christopher Nielson EMAIL Thomas F. Osborne EMAIL Rajiv Raman EMAIL Kim Ramasamy EMAIL Rory Sayres EMAIL ... Harini Suresh EMAIL Victor Veitch EMAIL ... D. Sculley EMAIL
Pseudocode No The paper describes various models and methods but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper mentions a canonical codebase for word2vec as a third-party tool it used: 'Canonical codebase is https://code.google.com/archive/p/word2vec/; a Git Hub export of this repository is available at https://github.com/tmikolov/word2vec.' However, this is not the authors' own implementation code for the methodology described in this paper.
Open Datasets Yes Image Net validation set (Deng et al., 2009). ... JFT-300M dataset (Sun et al., 2017). ... Using data from the UK Biobank (Sudlow et al., 2015). ... de-identified retrospective fundus images from Eye PACS in the United States and from eye hospitals in India. ... Onto Notes dataset (Hovy et al., 2006). ... Stereo Set benchmark (Nadeem et al., 2020). ... HANS stress test (Mc Coy et al., 2019b) and the Stress Test suite from Naik et al. (2018).
Dataset Splits Yes On the Image Net validation set, the Res Net-50 predictors achieve a 75.9% 0.11 top-1 accuracy... The Image Net test set is the iid evaluation... For our experiment, we restrict the pipeline to incorporate only standard iid validation. ... we evaluate the predictors on a stress test that stratifies the test set by skin type... For tasks that require fine-tuning, we fine-tune each of the five checkpoints 20 times using different random seeds. ... we partitioned the UK Biobank population into British and non British individuals, and then we randomly partitioned the British individuals into British training and evaluation set. We leave the non-British individuals out of training and use them solely for evaluation. ... We then randomly partitioned 91,971 British individuals defined as above into a British training set (82,309 individuals) and a British evaluation set (9,662 individuals). All remaining non-British set (14,898 individuals) was used for evaluation.
Hardware Specification No The paper describes various deep learning models (e.g., Res Net-50, Bi T, Inception-V4, BERT, RNN) and their configurations, but does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for training or inference.
Software Dependencies Yes We identified and clustered the IOP-associated variants with PLINK v1.9 Purcell et al. (2007), a standard tool in population genetics, using the --clump command.
Experiment Setup No We train 50 Res Net-50 models on Image Net using identical pipelines that differ only in their random seed, 30 Bi T models that are initialized at the same JFT-300M-trained checkpoint, and differ only in their fine-tuning seed and initialization distributions (10 runs each of zero, uniform, and Gaussian initializations). ... Specifically, we train 5 instances of the BERT large-cased language model (Devlin et al., 2019), using the same Wikipedia and Book Corpus data that was used to train the public checkpoints. For tasks that require fine-tuning, we fine-tune each of the five checkpoints 20 times using different random seeds. ... To examine underspecification, we construct a predictor ensemble by training the model from 5 random seeds for each of three RNN cell types: Simple Recursive Units (SRU, Lei et al. (2018)), Long Short-Term Memory (LSTM, Hochreiter and Schmidhuber (1997)) or Update Gate RNN (UGRNN, Collins et al. (2017)). This yields an ensemble of 15 predictors in total. The paper describes the variations in random seeds and initialization distributions but does not provide concrete hyperparameter values (e.g., learning rate, batch size, epochs) for the training of these models.