reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training LLMs over Neurally Compressed Text

Authors: Brian Lester, Jaehoon Lee, Alexander A Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we explore the idea of training large language models (LLMs) over highly compressed text... we demonstrate eﬀective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the beneﬁt of shorter sequence lengths.
Researcher Affiliation	Industry	a Google Deep Mind b Anthropic EMAIL
Pseudocode	No	The paper describes algorithms and processes, such as Arithmetic Coding and Equal-Info Windows, in detail using prose and mathematical notation, but it does not include any clearly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code	No	The paper mentions the use of existing libraries like TensorFlow Compression (Ballé et al., 2024) and the Python zlib library, but it does not state that the authors are releasing their own implementation code for the methodology described in this paper.
Open Datasets	Yes	All training data used is English web text from C4 (en 3.1.0) (Raﬀel et al., 2020).
Dataset Splits	Yes	M1 and M2 are both trained on the C4 training data, but the ﬁnal validation data used to evaluate M2 is unseen during M1 training, therefore there is no information leakage. This is similar to how LLM tokenizers are often trained on same dataset that the LLM is subsequently trained on.
Hardware Specification	No	The paper mentions 'running on parallel hardware' and 'TPU' (in the context of numerical noise in LLM inference), but it does not specify any exact GPU or CPU models, processor types, or detailed computer specifications used for running the experiments.
Software Dependencies	No	The paper mentions several software components like Python zlib library (Van Rossum & Drake, 2009), TensorFlow Compression (Ballé et al., 2024), Jax (Bradbury et al., 2018), Flax (Heek et al., 2020), T5X (Roberts et al., 2023), Matplolib (Hunter, 2007), Seaborn (Waskom, 2021), and Sci Py (Virtanen et al., 2020). However, specific version numbers for these libraries or programming languages are not provided.
Experiment Setup	Yes	M1 training: The model used for compression is a decoder-only Transformer model... uses the 3m size seen in Table 4 and a context length of 1,024. We use a batch size of 128, an rsqrt decay learning rate schedule (1/ steps) starting at 1.0 with 10,000 warmup steps, and a z-loss of 0.0001. The model is trained for 2,500,000 steps using the Adafactor (Shazeer & Stern, 2018) optimizer. M2 training: Each M2 model is trained for 200,000 steps with a batch size of 256 and a sequence length of 512. All other hyperparameters match those used in M1. Table 4 provides specific details for model sizes including Embedding Dim, #Heads, #Layers, Head Dim, and MLP Dim.