Training LLMs over Neurally Compressed Text
Authors: Brian Lester, Jaehoon Lee, Alexander A Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we explore the idea of training large language models (LLMs) over highly compressed text... we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. |
| Researcher Affiliation | Industry | a Google Deep Mind b Anthropic EMAIL |
| Pseudocode | No | The paper describes algorithms and processes, such as Arithmetic Coding and Equal-Info Windows, in detail using prose and mathematical notation, but it does not include any clearly labeled pseudocode blocks or algorithms in a structured, code-like format. |
| Open Source Code | No | The paper mentions the use of existing libraries like TensorFlow Compression (Ballé et al., 2024) and the Python zlib library, but it does not state that the authors are releasing their own implementation code for the methodology described in this paper. |
| Open Datasets | Yes | All training data used is English web text from C4 (en 3.1.0) (Raffel et al., 2020). |
| Dataset Splits | Yes | M1 and M2 are both trained on the C4 training data, but the final validation data used to evaluate M2 is unseen during M1 training, therefore there is no information leakage. This is similar to how LLM tokenizers are often trained on same dataset that the LLM is subsequently trained on. |
| Hardware Specification | No | The paper mentions 'running on parallel hardware' and 'TPU' (in the context of numerical noise in LLM inference), but it does not specify any exact GPU or CPU models, processor types, or detailed computer specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions several software components like Python zlib library (Van Rossum & Drake, 2009), TensorFlow Compression (Ballé et al., 2024), Jax (Bradbury et al., 2018), Flax (Heek et al., 2020), T5X (Roberts et al., 2023), Matplolib (Hunter, 2007), Seaborn (Waskom, 2021), and Sci Py (Virtanen et al., 2020). However, specific version numbers for these libraries or programming languages are not provided. |
| Experiment Setup | Yes | M1 training: The model used for compression is a decoder-only Transformer model... uses the 3m size seen in Table 4 and a context length of 1,024. We use a batch size of 128, an rsqrt decay learning rate schedule (1/ steps) starting at 1.0 with 10,000 warmup steps, and a z-loss of 0.0001. The model is trained for 2,500,000 steps using the Adafactor (Shazeer & Stern, 2018) optimizer. M2 training: Each M2 model is trained for 200,000 steps with a batch size of 256 and a sequence length of 512. All other hyperparameters match those used in M1. Table 4 provides specific details for model sizes including Embedding Dim, #Heads, #Layers, Head Dim, and MLP Dim. |