Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Think before you speak: Training Language Models With Pause Tokens
Authors: Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate pause-training on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains on our tasks when the model is both pretrained and finetuned with delays. For the 1B model, we witness gains on eight tasks, most prominently, a gain of 18% EM score on the QA task of SQu AD, 8% on Common Sense QA and 1% accuracy on the reasoning task of GSM8k. |
| Researcher Affiliation | Collaboration | Sachin Goyal Machine Learning Department Carnegie Mellon University EMAIL Ziwei Ji Google Research, NY EMAIL Ankit Singh Rawat Google Research, NY EMAIL Aditya Krishna Menon Google Research, NY EMAIL Sanjiv Kumar Google Research, NY EMAIL Vaishnavh Nagarajan Google Research, NY EMAIL Work done in part as a Student Researcher at Google. |
| Pseudocode | Yes | Algorithm 1: Pause-pretraining Algorithm 2: Pause-finetuning Stage 2: Finetuning with Pause Algorithm 3: Pause-inference Stage 3: Inference with Pause |
| Open Source Code | No | The paper does not contain an explicit statement or link providing access to the source code for the methodology described in the paper. |
| Open Datasets | Yes | Both the standard and pause models are pretrained on the C4 English mixture (Raffel et al., 2020), using the causal next token prediction objective for a total of 200B tokens (slightly more than 1 epoch on C4). We consider nine varied downstream tasks: (a) reasoning (GSM8k (Cobbe et al., 2021)), (b) extractive question answering (SQu AD (Rajpurkar et al., 2016), Co QA (Reddy et al., 2019)), (c) general understanding (Common Sense QA (Talmor et al., 2019), Physical IQA (Bisk et al., 2020)), (d) long term context recall (LAMBADA (Paperno et al., 2016)), (e) natural language inference (Hella Swag (Zellers et al., 2019)), and (f) fact recall (Web Questions (Berant et al., 2013), Natural Questions (Kwiatkowski et al., 2019)). |
| Dataset Splits | No | The paper mentions 'For all the downstream finetuning experiments, we report mean and standard deviation over 5 runs (with the randomness purely from the finetuning stage)' and 'We tune the learning rate and batch size', which implies a validation process. However, it does not explicitly provide specific train/validation/test dataset split percentages, absolute sample counts for each split, or references to predefined validation splits with citations. |
| Hardware Specification | No | The paper mentions using 'decoder-only models of size 1B and 130M' but does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or cloud computing environments with specifications. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments. |
| Experiment Setup | Yes | We tune the learning rate and batch size for standard end-to-end training, and use the best hyperparameter for all other training variants as well. We share all the hyperparameters in Appendix H. Table 3: Downstream finetuning hyperparameters for the 1B model. Table 4: Downstream finetuning hyperparameters for the 130M model. |