reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Emergent Abilities of Large Language Models

Authors: Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence raises the question of whether additional scaling could potentially further expand the range of capabilities of language models. We survey emergent abilities as observed in a range of prior work, categorizing them in settings such as few-shot prompting (§ 3) and augmented prompting strategies (§ 4).
Researcher Affiliation	Collaboration	Jason Wei 1 EMAIL Yi Tay 1 EMAIL Rishi Bommasani 2 EMAIL Colin Raﬀel 3 craﬀel@gmail.com Barret Zoph 1 EMAIL Sebastian Borgeaud 4 EMAIL Dani Yogatama 4 EMAIL Maarten Bosma 1 EMAIL Denny Zhou 1 EMAIL Donald Metzler 1 EMAIL Ed H. Chi 1 EMAIL Tatsunori Hashimoto 2 EMAIL Oriol Vinyals 4 EMAIL Percy Liang 2 EMAIL JeﬀDean 1 jeﬀ@google.com William Fedus 1 EMAIL 1Google Research 2Stanford University 3UNC Chapel Hill 4Deep Mind
Pseudocode	No	The paper is a survey and analysis of emergent abilities in large language models. It describes methodologies and observations from other research, but does not present any novel algorithms or pseudocode blocks within its own text.
Open Source Code	No	In this paper, we surveyed results in the existing literature, without proposing new methods or models.
Open Datasets	Yes	BIG-Bench. Figure 2A D depicts four emergent few-shot prompted tasks from BIG-Bench, a crowd-sourced suite of over 200 benchmarks for language model evaluation (BIG-Bench, 2022). Truthful QA. Figure 2E shows few-shot prompted performance on the Truthful QA benchmark, which measures the ability to answer questions truthfully (Lin et al., 2021). Multi-task language understanding. Figure 2G shows the Massive Multi-task Language Understanding (MMLU) benchmark, which aggregates 57 tests covering a range of topics including math, history, law, and more (Hendrycks et al., 2021a). Word in Context. Finally, Figure 2H shows the Word in Context (Wi C) benchmark (Pilehvar & Camacho Collados, 2019).
Dataset Splits	No	The paper discusses various tasks and benchmarks (e.g., BIG-Bench, MMLU) and refers to the experimental setups of prior work (e.g., "2-shot" for BIG-Bench tasks). However, it does not explicitly define or specify any new or detailed training/test/validation dataset splits used for its own analysis beyond implicitly relying on the established setups of the referenced benchmarks.
Hardware Specification	No	The paper analyzes existing research on emergent abilities of large language models. It refers to model scale in terms of 'training FLOPs' and 'model parameters' but does not specify the hardware used to conduct its own analysis or for the experiments discussed from other papers.
Software Dependencies	No	The paper is a survey of existing literature and does not describe a novel system or implementation requiring specific software dependencies or versions. Therefore, it does not list any software dependencies.
Experiment Setup	No	The paper is a survey of prior work on emergent abilities in large language models. It discusses various prompting strategies and tasks but does not describe a specific experimental setup with hyperparameters or system-level training settings of its own research work.