reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Holistic Evaluation of Language Models

Authors: Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue WANG, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri S. Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Andrew Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. [...] Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, including 21 scenarios that were not previously used in mainstream LM evaluation. [...] Our evaluation surfaces 25 top-level findings concerning the interplay between different scenarios, metrics, and models.
Researcher Affiliation	Collaboration	Percy Liang , Rishi Bommasani , Tony Lee , Dimitris Tsipras , Dilara Soylu , Michihiro Yasunaga , Yian Zhang , Deepak Narayanan , Yuhuai Wu , Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda EMAIL, EMAIL, EMAIL Center for Research on Foundation Models (CRFM) Institute for Human-Centered Artificial Intelligence (HAI) Stanford University
Pseudocode	No	The paper describes methods and processes through narrative text and diagrams (e.g., Figures 1, 2, 3, 5, 6, 7), but it does not include any explicitly labeled pseudocode blocks or algorithms with structured steps.
Open Source Code	Yes	For full transparency, we release all raw model prompts and completions publicly1 for further analysis, as well as a general modular toolkit for easily adding new scenarios, models, metrics, and prompting strategies.2 https://github.com/stanford-crfm/helm
Open Datasets	Yes	To select questionanswering datasets, we prioritized (i) domain coverage, in terms of the domain of the inputs/contexts and (ii) coverage of component skills required for the datasets (e.g. we deliberately ensured of datasets that required commonsense knowledge and reasoning). We selected the Natural Questions (Kwiatkowski et al., 2019), Narrative QA (Kočisky et al., 2017), and Qu AC (Choi et al., 2018) datasets... We access the dataset at https://github.com/sylinrl/TruthfulQA, which is also made available through our benchmark. ...Wiki Fact is a new dataset constructed in this work. ... The dataset is made available through our benchmark.
Dataset Splits	Yes	The train-dev-test splits for the dataset are 9427-3270-3245 samples. (Bool Q, Section B.1.1) ... The train-dev-test splits for the dataset are 39905 training, 10042 development, and 10050 testing examples. (Hella Swag, Section B.1.6) ... We split the 1000 triples into 100 training, 50 dev, and 850 test examples. (Wiki Fact, Section E.2.1)
Hardware Specification	Yes	To perform inference on the public models, we used the Together Research Computer. At the time of this work, Together Research Computer connects clusters at several institutions. We mainly use NVIDIA Ge Force RTX 2080 Ti GPUs and NVIDIA A100 GPUs to perform inference. (Table 6: Hardware and compute for public models) ... GPT-J (6B) was trained using 256 TPU v3 cores...
Software Dependencies	No	The paper mentions software like 'pyserini library' and 'Megatron (Shoeybi et al., 2019)' but does not specify version numbers for these or other key software components used in their methodology, which is necessary for reproducible setup.
Experiment Setup	Yes	In short, we use prompting as our adaptation method with 5 in-context examples (when in-context examples are included) as depicted in Figure 23. ... For all scenarios where the ideal model behavior is to generate a very short sequence that should match the correct reference, we set the temperature to be zero as we desire the argmax under the model’s distribution. For the longer-form generation scenarios such as text summarization, we either follow prior work or specify a process by which we arrived at the temperature we used. ... We generally set the stop sequence to be the newline character \n. ... Additionally, we set max tokens for each scenario based on the longest reference’s length for that scenario.