Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Reformer: The Efficient Transformer
Authors: Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment on a synthetic task, a text task (enwik8) with sequences of length 64K and an image generation task (imagenet-64 generation) with sequences of length 12K. In both cases we show that Reformer matches the results obtained with full Transformer but runs much faster, especially on the text task, and with orders of magnitude better memory efficiency. |
| Researcher Affiliation | Collaboration | Nikita Kitaev U.C. Berkeley & Google Research EMAIL Łukasz Kaiser Google Research EMAIL Anselm Levskaya Google Research |
| Pseudocode | No | The paper does not contain any blocks explicitly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | Code for training our models is made publicly available.2 (Footnote 2: https://github.com/google/trax/tree/master/trax/models/reformer) |
| Open Datasets | Yes | We ran our experiments on the imagenet64 and enwik8-64K tasks, where the latter is a variant of enwik8 that is chunked into subsequences of 216 = 64K tokens. |
| Dataset Splits | No | The paper mentions training and evaluation on 'held-out data' and 'test set', but does not provide specific percentages or counts for training, validation, and test splits for the datasets used (enwik8, imagenet64, WMT 2014). |
| Hardware Specification | Yes | Training for all experiments was parallelized across 8 devices (8 GPUs or 8 TPU v3 cores). |
| Software Dependencies | No | The paper mentions using the Adafactor optimizer for training but does not provide specific version numbers for it or any other software dependencies. |
| Experiment Setup | Yes | All experiments have dmodel = 1024, dff = 4096, nheads = 8, and a total batch size of 8 sequences. We used the Adafactor optimizer (Shazeer & Stern, 2018) for training these models. We train it for 150K steps in 4 different settings: with full attention, LSH attention with nrounds = 1, nrounds = 2 and nrounds = 4. |