reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Inverse Scaling: When Bigger Isn't Better

Authors: Ian R. McKenzie, Alexander Lyzhov, Michael Martin Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Xudong Shen, Joe Cavanagh, Andrew George Gritsevskiy, Derik Kauffman, Aaron T. Kirtland, Zhengping Zhou, Yuhui Zhang, Sicong Huang, Daniel Wurgaft, Max Weiss, Alexis Ross, Gabriel Recchia, Alisa Liu, Jiacheng Liu, Tom Tseng, Tomasz Korbak, Najoung Kim, Samuel R. Bowman, Ethan Perez

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool.
Researcher Affiliation	Collaboration	Ian R. Mc Kenzie EMAIL FAR AI, New York University Alexander Lyzhov New York University Michael Pieler Stability AI Alicia Parrish New York University, Google Aaron Mueller Johns Hopkins University, New York University Ameya Prabhu Oxford University Euan Mc Lean FAR AI Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zhengping Zhou Winning task authors Najoung Kim Boston University, Google Samuel R. Bowman New York University, Anthropic Ethan Perez EMAIL FAR AI, New York University, Anthropic
Pseudocode	No	The paper describes the tasks and their scaling behavior but does not present any structured pseudocode or algorithm blocks for its own methodology.
Open Source Code	Yes	We release the data at https://inversescaling.com/data under a CC BY 4.0 license. [...] See https://github.com/inverse-scaling/prize for all evaluations, including performance in the few-shot setting and performance through training.
Open Datasets	Yes	We release the data at https://inverscaling.com/data under a CC BY 4.0 license.
Dataset Splits	No	We evaluated submissions in zero-shot (no examples provided in the input) and few-shot (a few examples provided) settings across model series from Open AI, Anthropic, and Deep Mind, covering over 5 orders of magnitude: 1018 to 1023 training FLOPs. [...] We required at least 300 examples per task, and we recommended aiming for around 1000 examples for a clearer demonstration of scaling trends.
Hardware Specification	No	The paper discusses various large language models (e.g., GPT-3, Gopher, Chinchilla) and their training FLOPs, but does not specify the hardware used by the authors to conduct their evaluations or experiments.
Software Dependencies	No	We offered Google Colab notebooks for evaluating inverse scaling with the GPT-3 (Brown et al., 2020), GPT-2 (Radford et al., 2019), and OPT (up to 13B with Colab Pro+; Zhang et al., 2022) model series when developing a task. To query the GPT-3 models, participants had to use credits for the Open AI API. [...] We would also like to thank Scott Heiner, Edwin Chen, and others from Surge AI for organizing human validation and offering support to participants, and Jason Phang, Stella Biderman, and Hugging Face for their help running evaluations on large public models.
Experiment Setup	Yes	Participants submitted a dataset of input-output examples in the form of a text completion task. Along with the dataset, participants submitted justification for the importance of the task and scaling plots on GPT-3 models (Brown et al., 2020). [...] We evaluated submissions in zero-shot (no examples provided in the input) and few-shot (a few examples provided) settings... We required at least 300 examples per task... Winning submissions used one of the following two evaluation metrics: Classification Loss... Loss on a sequence at the end of a prompt (sequence prob).