PaLM: Scaling Language Modeling with Pathways

Authors: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Noah Fiedel

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To further our understanding of the impact of scale on few-shot learning, we trained a 540billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (Pa LM). We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, Pa LM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.
Researcher Affiliation Industry Aakanksha Chowdhery EMAIL Sharan Narang EMAIL Jacob Devlin EMAIL ... Work done while at Google
Pseudocode No The paper describes model architectures and modifications (e.g., Swi GLU Activation, Parallel Layers) using mathematical equations and descriptive text, but it does not present any explicitly labeled pseudocode or algorithm blocks. For instance, the parallel formulation is described as: 'y = x + MLP(Layer Norm(x)) + Attention(Layer Norm(x))' which is a mathematical representation, not pseudocode.
Open Source Code No The paper mentions: 'Our training and evaluation codebase is based on JAX (Bradbury et al., 2018) and T5X (Roberts et al., 2022)'. It also states: 'The full evaluation results of Pa LM on BIG-bench will be made available there [https://github.com/google/BIG-bench].' However, these refer to tools used or evaluation results, not the release of the specific source code for the Pa LM model or its methodology.
Open Datasets Yes The Pa LM pretraining dataset consists of a high-quality corpus of 780 billion tokens that represent a wide range of natural language use cases. The dataset is a mixture of filtered webpages, books, Wikipedia, news articles, source code, and social media conversations. This dataset is based on the datasets used to train La MDA (Thoppilan et al., 2022) and GLa M (Du et al., 2021). ... In this section, we evaluate the Pa LM model on the same set of 29 English benchmarks as Du et al. (2021) and Brown et al. (2020). The benchmarks include: Open-Domain Closed-Book Question Answering tasks: Trivia QA (Joshi et al., 2017), Natural Questions (Kwiatkowski et al., 2019), Web Questions (Berant et al., 2013) ... BIG-bench is a collaborative benchmark aimed at producing challenging tasks for large language models (BIG-bench collaboration, 2021).
Dataset Splits Yes A sequence length of 2048 was used for all models. Input examples are concatenated together and then split into sequences of exactly 2048 tokens... For all models, we increase the batch size during training. For the largest model, we use batch size 512 (1M tokens) until step 50k, then double it to 1024 (2M tokens) until step 115k, and finally double again it to 2048 (4M tokens) until training is complete at step 255k. The smaller models followed similar schedules. ... The splits for each task are the same ones used in Du et al. (2021) and Brown et al. (2020).
Hardware Specification Yes We trained Pa LM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. ... all models are trained on TPU v4 Pods (Jouppi et al., 2020). Pa LM 540B is trained over two TPU v4 Pods connected over data center network (DCN) using a combination of model and data parallelism (Xu et al., 2021). We use 3072 TPU v4 chips in each Pod attached to 768 hosts.
Software Dependencies No Our training and evaluation codebase is based on JAX (Bradbury et al., 2018) and T5X (Roberts et al., 2022) and all models are trained on TPU v4 Pods (Jouppi et al., 2020). While these software components are mentioned and cited, specific version numbers for JAX and T5X are not provided in the text to ensure reproducibility.
Experiment Setup Yes Weight initialization The kernel weights (i.e., everything but the embeddings and layer norm scales) are initialized with fan-in variance scaling , i.e., W N(0, 1/ nin), where nin is the input dimension of the kernel. ... Optimizer The model was trained with the Adafactor optimizer (Shazeer and Stern, 2018), without factorization. ... Optimization hyperparameters We use an Adafactor learning rate of 10 2 for the first 10,000 steps, which is then decayed at a rate of 1/k, where k is the step number. We train with momentum of β1 = 0.9. The second-order moment interpolation value is computed as β2 = 1.0 k 0.8... We use global norm gradient clipping (Pascanu et al. (2012)) with a value of 1.0 for all models. We use a dynamic weight decay of lr2.0 during training... Loss function The model is trained with the standard language modeling loss function... We additionally use an auxiliary loss of z loss = 10 4 log2 Z... Sequence length A sequence length of 2048 was used for all models. ... Batch size For all models, we increase the batch size during training. For the largest model, we use batch size 512 (1M tokens) until step 50k, then double it to 1024 (2M tokens) until step 115k, and finally double again it to 2048 (4M tokens) until training is complete at step 255k.