Learning the Transformer Kernel

Authors: Sankalan Pal Chowdhury, Adamos Solomou, Kumar Avinava Dubey, Mrinmaya Sachan

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally evaluate our models on LRA (tasks with long context), GLUE (tasks with short context) and a synthetic dataset with controllable sparsity, and analyze the performance of our models ( 3, 2.2). In our experiments, we find that learnt kernels improve performance in long-context tasks, while staying competitive to the Softmax Transformer of the same size in short-context tasks.
Researcher Affiliation Collaboration Sankalan Pal Chowdhury EMAIL Department of Computer Science ETH Zürich Adamos Solomou EMAIL Department of Computer Science ETH Zürich Avinava Dubey EMAIL Google Research Mountain View, CA Mrinmaya Sachan EMAIL Department of Computer Science ETH Zürich
Pseudocode No The paper describes the methodology using mathematical equations and textual descriptions, but does not present any structured pseudocode or algorithm blocks in the main text or appendix.
Open Source Code Yes 1Our code and models are available at https://github.com/cs1160701/On Learning The Kernel
Open Datasets Yes Long Range Arena (LRA; Tay et al. 2021b) is a diverse benchmark for the purpose of evaluating the ability of sequence models to reason under long-context scenarios... We pre-train all models (including Softmax Transformer) on the Wiki Text-103 dataset (Merity et al., 2016) using non-contextual Word Piece embeddings (Wu et al., 2016). Pre-trained models are then fine-tuned on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019)... The gray-scaled CIFAR10 image classification dataset (Krizhevsky, 2009) is used, resulting in a sequence length of 1024.
Dataset Splits Yes Setup: To ensure a fair comparison, we closely follow the same data preprocessing, data split, model size and training procedure as in (Tay et al., 2021b)... Each dataset has 200K instances, of sequence length 200. Of these, we use 80% as the training set and the rest for validation.
Hardware Specification Yes In both cases experiments are conducted on 8 NVIDIA TITAN RTX GPUs.
Software Dependencies No The paper mentions 'Python 3 and Py Torch (Paszke et al., 2019)' but does not specify exact version numbers for these software components (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes We outline the hyperparameters for all tasks in Table 6 in the Appendix. [...] Table 8: Hyperparameters for GLUE tasks. Where multiple parameters were tried, they are listed in curly brackets.