Linearized Relative Positional Encoding

Authors: Zhen Qin, Weixuan Sun, Kaiyue Lu, Hui Deng, Dongxu Li, Xiaodong Han, Yuchao Dai, Lingpeng Kong, Yiran Zhong

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that compared with existing methods, LRPE achieves state-of-the-art performance in language modeling, text classification, and image classification. Table 1: Quantitative results of the Roberta model fine-tuned on the GLUE dataset.
Researcher Affiliation Collaboration 1Shanghai AI Laboratory 2Open NLPLab 3Australian National University 4Northwestern Polytechnical University 5The University of Hong Kong
Pseudocode Yes D.2 Pseudocode In this section, we provide pseudocodes for LRPE in Python:
Open Source Code No No explicit statement about the release of source code or a repository link for the methodology described in the paper was found. The section D.2 provides pseudocode for illustration, but not the full implementation code.
Open Datasets Yes Dataset We use Wikitext-103 Merity et al. (2016), Books Zhu et al. (2015), and Wiki Book Wettig et al. (2022) datasets for NLP task evaluation and Image Net-1k Deng et al. (2009) for image classification evaluation. pretrained and then fine-tuned on several downstream tasks from the GLUE benchmark (Wang et al., 2018). conducted experiments on Long-Range Arena benchmark (Tay et al., 2020).
Dataset Splits Yes Dataset We use Wikitext-103 Merity et al. (2016), Books Zhu et al. (2015), and Wiki Book Wettig et al. (2022) datasets for NLP task evaluation and Image Net-1k Deng et al. (2009) for image classification evaluation. pretrained and then fine-tuned on several downstream tasks from the GLUE benchmark (Wang et al., 2018). conducted experiments on Long-Range Arena benchmark (Tay et al., 2020).
Hardware Specification Yes Our experiments are implemented in the Fairseq framework (Ott et al., 2019) and trained with V100 GPUs.
Software Dependencies No Our experiments are implemented in the Fairseq framework (Ott et al., 2019) and trained with V100 GPUs. (Mentions Fairseq, but no specific version number.)
Experiment Setup Yes Table 8: Detailed configurations used in our experiments. Total batch size means batch_per_gpu update_freq num_gpus. Attention dropout is only used for vanilla attention. ALM : autoregressive Language Model. BLM : bidirectional Language Model. IM : Image Modeling. and Table 9: Detailed configurations used in LRA experiments. BN stands for batch normalization. All methods use the same configuration, except for relative positional encodings.