reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Instruction-Finetuned Language Models

Authors: Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (Pa LM, T5, U-Pa LM), prompting setups (zero-shot, few-shot, Co T), and evaluation benchmarks (MMLU, BBH, Ty Di QA, MGSM, open-ended generation, Real Toxicity Prompts).
Researcher Affiliation	Collaboration	Hyung Won Chung EMAIL Le Hou EMAIL Shayne Longpre EMAIL Barret Zoph EMAIL Yi Tai EMAIL William Fedus EMAIL Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei. At the time of experimentation and writing all authors were affiliated with Google, either as employees or interns.
Pseudocode	No	The paper describes methods in prose and through figures like Figure 1 and Figure 3, which illustrate data formats and finetuning procedures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as Pa LM 62B. 1. Public checkpoints: https://github.com/google-research/t5x/blob/main/docs/models.md# flan-t5-checkpoints.
Open Datasets	Yes	Our final set of finetuning tasks is sourced from a combination of tasks from FLAN, T0, Super Natural Instructions, along with some dialog, program synthesis, and chain-of-thought reasoning tasks, as described in Figure 2. We provide specific pointers and citations in Table 13. All data sources are publicly available.
Dataset Splits	Yes	For MMLU and BBH, we evaluate both the ability to directly predict the answer via direct prompting, where the model directly gives the answer (Brown et al., 2020; Srivastava et al., 2022), as well as via chain-of-thought (Co T) prompting... For all benchmarks we use the given few-shot exemplars, with the number of exemplars following prior work: five-shot for MMLU, three-shot for BBH, one-shot for Ty Di QA, and 8-shot for MGSM... The evaluation is performed on 10,000 random samples from the test split of the Civil Comments dataset, whilst the few-shot examples are drawn from the train split in a balanced way (i.e., the same number of toxic and non-toxic samples).
Hardware Specification	Yes	For example, we only use 0.2% of the pre-training compute to instruction-finetune Flan-Pa LM 540B (approximately 512 v4 TPU chips for 37 hours). We use the JAX-based T5X framework (Bradbury et al., 2018; Roberts et al., 2022). Hardware: TPU v3 or TPU v4 (Jouppi et al., 2020).
Software Dependencies	No	We use the JAX-based T5X framework (Bradbury et al., 2018; Roberts et al., 2022). Software: T5X (Roberts et al., 2022), JAX (Bradbury et al., 2018). The paper mentions the software frameworks T5X and JAX and cites papers describing them, but it does not provide specific version numbers for these software components.
Experiment Setup	Yes	For each model, we apply the same training procedure, except for a few hyperparameters: learning rate, batch size, dropout, and finetuning steps. We use a constant learning rate schedule and finetune only on the outputs, using the Adafactor optimizer (Shazeer and Stern, 2018). The number of finetuning steps, learning rate, batch size, and dropout for each model are given in Appendix C. (Appendix C, Table 11 provides specific values for Batch size, Dropout, LR, and Steps for each model).