reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning the Transformer Kernel

Authors: Sankalan Pal Chowdhury, Adamos Solomou, Kumar Avinava Dubey, Mrinmaya Sachan

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally evaluate our models on LRA (tasks with long context), GLUE (tasks with short context) and a synthetic dataset with controllable sparsity, and analyze the performance of our models ( 3, 2.2). In our experiments, we find that learnt kernels improve performance in long-context tasks, while staying competitive to the Softmax Transformer of the same size in short-context tasks.
Researcher Affiliation	Collaboration	Sankalan Pal Chowdhury EMAIL Department of Computer Science ETH Zürich Adamos Solomou EMAIL Department of Computer Science ETH Zürich Avinava Dubey EMAIL Google Research Mountain View, CA Mrinmaya Sachan EMAIL Department of Computer Science ETH Zürich
Pseudocode	No	The paper describes the methodology using mathematical equations and textual descriptions, but does not present any structured pseudocode or algorithm blocks in the main text or appendix.
Open Source Code	Yes	1Our code and models are available at https://github.com/cs1160701/On Learning The Kernel
Open Datasets	Yes	Long Range Arena (LRA; Tay et al. 2021b) is a diverse benchmark for the purpose of evaluating the ability of sequence models to reason under long-context scenarios... We pre-train all models (including Softmax Transformer) on the Wiki Text-103 dataset (Merity et al., 2016) using non-contextual Word Piece embeddings (Wu et al., 2016). Pre-trained models are then fine-tuned on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019)... The gray-scaled CIFAR10 image classification dataset (Krizhevsky, 2009) is used, resulting in a sequence length of 1024.
Dataset Splits	Yes	Setup: To ensure a fair comparison, we closely follow the same data preprocessing, data split, model size and training procedure as in (Tay et al., 2021b)... Each dataset has 200K instances, of sequence length 200. Of these, we use 80% as the training set and the rest for validation.
Hardware Specification	Yes	In both cases experiments are conducted on 8 NVIDIA TITAN RTX GPUs.
Software Dependencies	No	The paper mentions 'Python 3 and Py Torch (Paszke et al., 2019)' but does not specify exact version numbers for these software components (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	We outline the hyperparameters for all tasks in Table 6 in the Appendix. [...] Table 8: Hyperparameters for GLUE tasks. Where multiple parameters were tried, they are listed in curly brackets.