reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reformer: The Efficient Transformer

Authors: Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya

ICLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment on a synthetic task, a text task (enwik8) with sequences of length 64K and an image generation task (imagenet-64 generation) with sequences of length 12K. In both cases we show that Reformer matches the results obtained with full Transformer but runs much faster, especially on the text task, and with orders of magnitude better memory efﬁciency.
Researcher Affiliation	Collaboration	Nikita Kitaev U.C. Berkeley & Google Research EMAIL Łukasz Kaiser Google Research EMAIL Anselm Levskaya Google Research
Pseudocode	No	The paper does not contain any blocks explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code	Yes	Code for training our models is made publicly available.2 (Footnote 2: https://github.com/google/trax/tree/master/trax/models/reformer)
Open Datasets	Yes	We ran our experiments on the imagenet64 and enwik8-64K tasks, where the latter is a variant of enwik8 that is chunked into subsequences of 216 = 64K tokens.
Dataset Splits	No	The paper mentions training and evaluation on 'held-out data' and 'test set', but does not provide specific percentages or counts for training, validation, and test splits for the datasets used (enwik8, imagenet64, WMT 2014).
Hardware Specification	Yes	Training for all experiments was parallelized across 8 devices (8 GPUs or 8 TPU v3 cores).
Software Dependencies	No	The paper mentions using the Adafactor optimizer for training but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup	Yes	All experiments have dmodel = 1024, dff = 4096, nheads = 8, and a total batch size of 8 sequences. We used the Adafactor optimizer (Shazeer & Stern, 2018) for training these models. We train it for 150K steps in 4 different settings: with full attention, LSH attention with nrounds = 1, nrounds = 2 and nrounds = 4.