Jump Self-attention: Capturing High-order Statistics in Transformers
Authors: Haoyi Zhou, Siyang Xiao, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li
NeurIPS 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With extensive experiments, we empirically show that our methods significantly increase the performance on ten different tasks. |
| Researcher Affiliation | Academia | Haoyi Zhou BDBC Beihang University Beijing, China 100191 EMAIL Siyang Xiao BDBC Beihang University Beijing, China 100191 EMAIL Shanghang Zhang School of Computer Science Peking University Beijing, China 100871 EMAIL Jieqi Peng BDBC Beihang University Beijing, China 100191 EMAIL Shuai Zhang BDBC Beihang University Beijing, China 100191 EMAIL BDBC Beihang University Beijing, China 100191 EMAIL |
| Pseudocode | No | The paper describes the proposed method conceptually and mathematically but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/zhouhaoyi/JAT2022. |
| Open Datasets | Yes | We conduct JAT s experiments of the language understanding and generalization capabilities on the General Language Understanding Evaluation (GLUE) benchmark [21], a collection of diverse natural language understanding tasks. We also perform additional experiments on Super GLUE benchmark [20]. Settings: We use a batch size of 32 and fine-tune for 5 epochs over the data for nine GLUE tasks. The threshold ρ is selected from {4.0, 4.5, . . . , 10.0}. The other settings follow the recommendation of the original paper. Since the proposed JAT can be used interchangeably with canonical self-attention, we perform a grid search on the layer replacement. There are two sets of layer deployment, where the first combination is chosen from {Layer1 4, Layer5 8, Layer9 12} and the alternative is {Layer1 6, Layer7 12}. Another important selection is the multi-heads grouping, we employ the side-by-side strategy as replacing the heads {2, 4, 6, 8, 10} with JAT. And we do not use any ensembling strategy or multi-tasking scheme in this fine-tuning. The evaluation is performed on the Dev set. Metric: We use three different evaluation metrics on the 9 tasks. |
| Dataset Splits | Yes | The evaluation is performed on the Dev set. |
| Hardware Specification | Yes | Platform: Intel Xeon 3.2GHz + The Nvidia V100 GPU (32 GB) X 4. |
| Software Dependencies | No | The paper mentions software like BERT, RoBERTa, and Mind Spore but does not provide specific version numbers for these or other ancillary software components (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Settings: We use a batch size of 32 and fine-tune for 5 epochs over the data for nine GLUE tasks. The threshold ρ is selected from {4.0, 4.5, . . . , 10.0}. |