reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Think before you speak: Training Language Models With Pause Tokens

Authors: Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate pause-training on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main ﬁnding is that inference-time delays show gains on our tasks when the model is both pretrained and ﬁnetuned with delays. For the 1B model, we witness gains on eight tasks, most prominently, a gain of 18% EM score on the QA task of SQu AD, 8% on Common Sense QA and 1% accuracy on the reasoning task of GSM8k.
Researcher Affiliation	Collaboration	Sachin Goyal Machine Learning Department Carnegie Mellon University EMAIL Ziwei Ji Google Research, NY EMAIL Ankit Singh Rawat Google Research, NY EMAIL Aditya Krishna Menon Google Research, NY EMAIL Sanjiv Kumar Google Research, NY EMAIL Vaishnavh Nagarajan Google Research, NY EMAIL Work done in part as a Student Researcher at Google.
Pseudocode	Yes	Algorithm 1: Pause-pretraining Algorithm 2: Pause-ﬁnetuning Stage 2: Finetuning with Pause Algorithm 3: Pause-inference Stage 3: Inference with Pause
Open Source Code	No	The paper does not contain an explicit statement or link providing access to the source code for the methodology described in the paper.
Open Datasets	Yes	Both the standard and pause models are pretrained on the C4 English mixture (Raffel et al., 2020), using the causal next token prediction objective for a total of 200B tokens (slightly more than 1 epoch on C4). We consider nine varied downstream tasks: (a) reasoning (GSM8k (Cobbe et al., 2021)), (b) extractive question answering (SQu AD (Rajpurkar et al., 2016), Co QA (Reddy et al., 2019)), (c) general understanding (Common Sense QA (Talmor et al., 2019), Physical IQA (Bisk et al., 2020)), (d) long term context recall (LAMBADA (Paperno et al., 2016)), (e) natural language inference (Hella Swag (Zellers et al., 2019)), and (f) fact recall (Web Questions (Berant et al., 2013), Natural Questions (Kwiatkowski et al., 2019)).
Dataset Splits	No	The paper mentions 'For all the downstream ﬁnetuning experiments, we report mean and standard deviation over 5 runs (with the randomness purely from the ﬁnetuning stage)' and 'We tune the learning rate and batch size', which implies a validation process. However, it does not explicitly provide specific train/validation/test dataset split percentages, absolute sample counts for each split, or references to predefined validation splits with citations.
Hardware Specification	No	The paper mentions using 'decoder-only models of size 1B and 130M' but does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or cloud computing environments with specifications.
Software Dependencies	No	The paper does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments.
Experiment Setup	Yes	We tune the learning rate and batch size for standard end-to-end training, and use the best hyperparameter for all other training variants as well. We share all the hyperparameters in Appendix H. Table 3: Downstream finetuning hyperparameters for the 1B model. Table 4: Downstream finetuning hyperparameters for the 130M model.