Emergent Abilities of Large Language Models

Authors: Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence raises the question of whether additional scaling could potentially further expand the range of capabilities of language models. We survey emergent abilities as observed in a range of prior work, categorizing them in settings such as few-shot prompting (§ 3) and augmented prompting strategies (§ 4).
Researcher Affiliation Collaboration Jason Wei 1 EMAIL Yi Tay 1 EMAIL Rishi Bommasani 2 EMAIL Colin Raffel 3 craffel@gmail.com Barret Zoph 1 EMAIL Sebastian Borgeaud 4 EMAIL Dani Yogatama 4 EMAIL Maarten Bosma 1 EMAIL Denny Zhou 1 EMAIL Donald Metzler 1 EMAIL Ed H. Chi 1 EMAIL Tatsunori Hashimoto 2 EMAIL Oriol Vinyals 4 EMAIL Percy Liang 2 EMAIL JeffDean 1 jeff@google.com William Fedus 1 EMAIL 1Google Research 2Stanford University 3UNC Chapel Hill 4Deep Mind
Pseudocode No The paper is a survey and analysis of emergent abilities in large language models. It describes methodologies and observations from other research, but does not present any novel algorithms or pseudocode blocks within its own text.
Open Source Code No In this paper, we surveyed results in the existing literature, without proposing new methods or models.
Open Datasets Yes BIG-Bench. Figure 2A D depicts four emergent few-shot prompted tasks from BIG-Bench, a crowd-sourced suite of over 200 benchmarks for language model evaluation (BIG-Bench, 2022). Truthful QA. Figure 2E shows few-shot prompted performance on the Truthful QA benchmark, which measures the ability to answer questions truthfully (Lin et al., 2021). Multi-task language understanding. Figure 2G shows the Massive Multi-task Language Understanding (MMLU) benchmark, which aggregates 57 tests covering a range of topics including math, history, law, and more (Hendrycks et al., 2021a). Word in Context. Finally, Figure 2H shows the Word in Context (Wi C) benchmark (Pilehvar & Camacho Collados, 2019).
Dataset Splits No The paper discusses various tasks and benchmarks (e.g., BIG-Bench, MMLU) and refers to the experimental setups of prior work (e.g., "2-shot" for BIG-Bench tasks). However, it does not explicitly define or specify any new or detailed training/test/validation dataset splits used for its own analysis beyond implicitly relying on the established setups of the referenced benchmarks.
Hardware Specification No The paper analyzes existing research on emergent abilities of large language models. It refers to model scale in terms of 'training FLOPs' and 'model parameters' but does not specify the hardware used to conduct its own analysis or for the experiments discussed from other papers.
Software Dependencies No The paper is a survey of existing literature and does not describe a novel system or implementation requiring specific software dependencies or versions. Therefore, it does not list any software dependencies.
Experiment Setup No The paper is a survey of prior work on emergent abilities in large language models. It discusses various prompting strategies and tasks but does not describe a specific experimental setup with hyperparameters or system-level training settings of its own research work.