Learning the Transformer Kernel
Authors: Sankalan Pal Chowdhury, Adamos Solomou, Kumar Avinava Dubey, Mrinmaya Sachan
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally evaluate our models on LRA (tasks with long context), GLUE (tasks with short context) and a synthetic dataset with controllable sparsity, and analyze the performance of our models ( 3, 2.2). In our experiments, we find that learnt kernels improve performance in long-context tasks, while staying competitive to the Softmax Transformer of the same size in short-context tasks. |
| Researcher Affiliation | Collaboration | Sankalan Pal Chowdhury EMAIL Department of Computer Science ETH Zürich Adamos Solomou EMAIL Department of Computer Science ETH Zürich Avinava Dubey EMAIL Google Research Mountain View, CA Mrinmaya Sachan EMAIL Department of Computer Science ETH Zürich |
| Pseudocode | No | The paper describes the methodology using mathematical equations and textual descriptions, but does not present any structured pseudocode or algorithm blocks in the main text or appendix. |
| Open Source Code | Yes | 1Our code and models are available at https://github.com/cs1160701/On Learning The Kernel |
| Open Datasets | Yes | Long Range Arena (LRA; Tay et al. 2021b) is a diverse benchmark for the purpose of evaluating the ability of sequence models to reason under long-context scenarios... We pre-train all models (including Softmax Transformer) on the Wiki Text-103 dataset (Merity et al., 2016) using non-contextual Word Piece embeddings (Wu et al., 2016). Pre-trained models are then fine-tuned on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019)... The gray-scaled CIFAR10 image classification dataset (Krizhevsky, 2009) is used, resulting in a sequence length of 1024. |
| Dataset Splits | Yes | Setup: To ensure a fair comparison, we closely follow the same data preprocessing, data split, model size and training procedure as in (Tay et al., 2021b)... Each dataset has 200K instances, of sequence length 200. Of these, we use 80% as the training set and the rest for validation. |
| Hardware Specification | Yes | In both cases experiments are conducted on 8 NVIDIA TITAN RTX GPUs. |
| Software Dependencies | No | The paper mentions 'Python 3 and Py Torch (Paszke et al., 2019)' but does not specify exact version numbers for these software components (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | We outline the hyperparameters for all tasks in Table 6 in the Appendix. [...] Table 8: Hyperparameters for GLUE tasks. Where multiple parameters were tried, they are listed in curly brackets. |