Transformer-VQ: Linear-Time Transformers via Vector Quantization

Authors: Lucas Dax Lingle

ICLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on Image Net64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput.
Researcher Affiliation Industry Lucas D. Lingle Independent Researcher EMAIL
Pseudocode Yes See pseudocode in Appendix E. Code 1: Jax/Flax pseudocode for VQ-Attention.
Open Source Code Yes Code available: https://github.com/transformer-vq/transformer_vq
Open Datasets Yes Enwik8 is a byte-level language modeling dataset consisting of 100 million bytes of unprocessed Englishlanguage Wikipedia articles (Mahoney, 2011)... Per convention, it is split into train, validation, and test sets of 90 million, 5 million, and 5 million bytes, respectively (Child et al., 2019; Rae et al., 2020).
Dataset Splits Yes Per convention, it is split into train, validation, and test sets of 90 million, 5 million, and 5 million bytes, respectively (Child et al., 2019; Rae et al., 2020).
Hardware Specification Yes For training, we use TPU v3 pod slices (Jouppi et al., 2017). We benchmark on a TPU v3 with 8 cores, using a global batch size of 8 sequences.
Software Dependencies No Transformer-VQ is implemented in Jax (Bradbury et al., 2018) and Flax (Heek et al., 2023).
Experiment Setup Yes C.1 HYPERPARAMETERS Per-dataset hyperparameters are provided below. Table 10: Hyperparameters.