reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Underspecification Presents Challenges for Credibility in Modern Machine Learning

Authors: Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, D. Sculley

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide evidence that underspecﬁcation has substantive implications for practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspeciﬁcation in modeling pipelines that are intended for real-world deployment in any domain. ... The second claim is that underspeciﬁcation is ubiquitous in modern applications of ML, and has substantial practical implications. We support this claim with an empirical study, in which we apply a simple experimental protocol across plausibly deployable deep learning pipelines in computer vision, medical imaging, natural language processing (NLP), and electronic health record (EHR) based prediction.
Researcher Affiliation	Collaboration	Alexander D Amour EMAIL Katherine Heller EMAIL ... Yian Ma EMAIL Cory Mc Lean EMAIL ... Andrea Montanari EMAIL Zachary Nado EMAIL ... Christopher Nielson EMAIL Thomas F. Osborne EMAIL Rajiv Raman EMAIL Kim Ramasamy EMAIL Rory Sayres EMAIL ... Harini Suresh EMAIL Victor Veitch EMAIL ... D. Sculley EMAIL
Pseudocode	No	The paper describes various models and methods but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper mentions a canonical codebase for word2vec as a third-party tool it used: 'Canonical codebase is https://code.google.com/archive/p/word2vec/; a Git Hub export of this repository is available at https://github.com/tmikolov/word2vec.' However, this is not the authors' own implementation code for the methodology described in this paper.
Open Datasets	Yes	Image Net validation set (Deng et al., 2009). ... JFT-300M dataset (Sun et al., 2017). ... Using data from the UK Biobank (Sudlow et al., 2015). ... de-identiﬁed retrospective fundus images from Eye PACS in the United States and from eye hospitals in India. ... Onto Notes dataset (Hovy et al., 2006). ... Stereo Set benchmark (Nadeem et al., 2020). ... HANS stress test (Mc Coy et al., 2019b) and the Stress Test suite from Naik et al. (2018).
Dataset Splits	Yes	On the Image Net validation set, the Res Net-50 predictors achieve a 75.9% 0.11 top-1 accuracy... The Image Net test set is the iid evaluation... For our experiment, we restrict the pipeline to incorporate only standard iid validation. ... we evaluate the predictors on a stress test that stratiﬁes the test set by skin type... For tasks that require ﬁne-tuning, we ﬁne-tune each of the ﬁve checkpoints 20 times using diﬀerent random seeds. ... we partitioned the UK Biobank population into British and non British individuals, and then we randomly partitioned the British individuals into British training and evaluation set. We leave the non-British individuals out of training and use them solely for evaluation. ... We then randomly partitioned 91,971 British individuals deﬁned as above into a British training set (82,309 individuals) and a British evaluation set (9,662 individuals). All remaining non-British set (14,898 individuals) was used for evaluation.
Hardware Specification	No	The paper describes various deep learning models (e.g., Res Net-50, Bi T, Inception-V4, BERT, RNN) and their configurations, but does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for training or inference.
Software Dependencies	Yes	We identiﬁed and clustered the IOP-associated variants with PLINK v1.9 Purcell et al. (2007), a standard tool in population genetics, using the --clump command.
Experiment Setup	No	We train 50 Res Net-50 models on Image Net using identical pipelines that diﬀer only in their random seed, 30 Bi T models that are initialized at the same JFT-300M-trained checkpoint, and diﬀer only in their ﬁne-tuning seed and initialization distributions (10 runs each of zero, uniform, and Gaussian initializations). ... Speciﬁcally, we train 5 instances of the BERT large-cased language model (Devlin et al., 2019), using the same Wikipedia and Book Corpus data that was used to train the public checkpoints. For tasks that require ﬁne-tuning, we ﬁne-tune each of the ﬁve checkpoints 20 times using diﬀerent random seeds. ... To examine underspeciﬁcation, we construct a predictor ensemble by training the model from 5 random seeds for each of three RNN cell types: Simple Recursive Units (SRU, Lei et al. (2018)), Long Short-Term Memory (LSTM, Hochreiter and Schmidhuber (1997)) or Update Gate RNN (UGRNN, Collins et al. (2017)). This yields an ensemble of 15 predictors in total. The paper describes the variations in random seeds and initialization distributions but does not provide concrete hyperparameter values (e.g., learning rate, batch size, epochs) for the training of these models.