Inverse Scaling: When Bigger Isn't Better
Authors: Ian R. McKenzie, Alexander Lyzhov, Michael Martin Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Xudong Shen, Joe Cavanagh, Andrew George Gritsevskiy, Derik Kauffman, Aaron T. Kirtland, Zhengping Zhou, Yuhui Zhang, Sicong Huang, Daniel Wurgaft, Max Weiss, Alexis Ross, Gabriel Recchia, Alisa Liu, Jiacheng Liu, Tom Tseng, Tomasz Korbak, Najoung Kim, Samuel R. Bowman, Ethan Perez
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. |
| Researcher Affiliation | Collaboration | Ian R. Mc Kenzie EMAIL FAR AI, New York University Alexander Lyzhov New York University Michael Pieler Stability AI Alicia Parrish New York University, Google Aaron Mueller Johns Hopkins University, New York University Ameya Prabhu Oxford University Euan Mc Lean FAR AI Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zhengping Zhou Winning task authors Najoung Kim Boston University, Google Samuel R. Bowman New York University, Anthropic Ethan Perez EMAIL FAR AI, New York University, Anthropic |
| Pseudocode | No | The paper describes the tasks and their scaling behavior but does not present any structured pseudocode or algorithm blocks for its own methodology. |
| Open Source Code | Yes | We release the data at https://inversescaling.com/data under a CC BY 4.0 license. [...] See https://github.com/inverse-scaling/prize for all evaluations, including performance in the few-shot setting and performance through training. |
| Open Datasets | Yes | We release the data at https://inverscaling.com/data under a CC BY 4.0 license. |
| Dataset Splits | No | We evaluated submissions in zero-shot (no examples provided in the input) and few-shot (a few examples provided) settings across model series from Open AI, Anthropic, and Deep Mind, covering over 5 orders of magnitude: 1018 to 1023 training FLOPs. [...] We required at least 300 examples per task, and we recommended aiming for around 1000 examples for a clearer demonstration of scaling trends. |
| Hardware Specification | No | The paper discusses various large language models (e.g., GPT-3, Gopher, Chinchilla) and their training FLOPs, but does not specify the hardware used by the authors to conduct their evaluations or experiments. |
| Software Dependencies | No | We offered Google Colab notebooks for evaluating inverse scaling with the GPT-3 (Brown et al., 2020), GPT-2 (Radford et al., 2019), and OPT (up to 13B with Colab Pro+; Zhang et al., 2022) model series when developing a task. To query the GPT-3 models, participants had to use credits for the Open AI API. [...] We would also like to thank Scott Heiner, Edwin Chen, and others from Surge AI for organizing human validation and offering support to participants, and Jason Phang, Stella Biderman, and Hugging Face for their help running evaluations on large public models. |
| Experiment Setup | Yes | Participants submitted a dataset of input-output examples in the form of a text completion task. Along with the dataset, participants submitted justification for the importance of the task and scaling plots on GPT-3 models (Brown et al., 2020). [...] We evaluated submissions in zero-shot (no examples provided in the input) and few-shot (a few examples provided) settings... We required at least 300 examples per task... Winning submissions used one of the following two evaluation metrics: Classification Loss... Loss on a sequence at the end of a prompt (sequence prob). |