AAN+: Generalized Average Attention Network for Accelerating Neural Transformer
Authors: Biao Zhang, Deyi Xiong, Yubin Ge, Junfeng Yao, Hao Yue, Jinsong Su
JAIR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply AAN+ as a drop-in replacement of the decoder self-attention and conduct experiments on machine translation (with diverse language pairs), table-to-text generation and document summarization. With masking tricks and dynamic programming, AAN+ enables Transformer to decode sentences around 20% faster without largely compromising in the training speed and the generation performance. |
| Researcher Affiliation | Academia | Biao Zhang EMAIL School of Informatics, Xiamen University, Xiamen 361005, China School of Informatics, University of Edinburgh Edinburgh EH8 9AB, United Kingdom Deyi Xiong EMAIL College of Intelligence and Computing, Tianjin University Tianjin 300350, China Yubin Ge EMAIL University of Illinois at Urbana-Champaign, Urbana, IL61801, USA Junfeng Yao EMAIL School of Film, Xiamen University, Xiamen 361005, China Hao Yue EMAIL School of Informatics, Xiamen University, Xiamen 361005, China Jinsong Su (corresponding author) EMAIL School of Informatics, Xiamen University Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism Xiamen 361005, China |
| Pseudocode | No | The paper describes methods using mathematical formulations (e.g., Equation 4, 5, 6) and descriptive text, but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about open-sourcing the code for the methodology described, nor does it include a link to a code repository. It mentions that "Our proposed model has been applied to several online translation services and has been adopted in Marian (Junczys-Dowmunt et al., 2018)", but this does not constitute a release of the specific implementation presented in the paper. |
| Open Datasets | Yes | We choose three translation tasks, including WMT14 English-German translation (En De, Bojar et al., 2014), WMT14 English-French translation (En-Fr) and NIST Chinese English translation (Zh-En, Zhang et al., 2020). ... We adopt the WIKIBIO dataset (Lebret et al., 2016) for table-to-text generation. ... We use the CNN/Daily Mail dataset (Hermann et al., 2015) for document summarization. |
| Dataset Splits | Yes | We use newstest2013, newstest2012+2013 and NIST2005 as the development set for WMT14 En-De, WMT14 En-Fr and NIST Zh-En, respectively. We report test results on newstest2014 for WMT14 En-De and WMT14 En-Fr, and NIST2002, NIST2003, NIST2004, NIST2006, NIST2008 for NIST Zh-En. ... We obtain 582,659 articles for training, 72,831 articles for validation, and another 72,831 articles for testing. ... We have 287,227 document-summary pairs in the training set, 13,368 pairs and 11,490 pairs in the validation set and test set, respectively. |
| Hardware Specification | Yes | Unless otherwise specified, all experiments are performed with Ge Force GTX 1080 Ti and all models are implemented in Tensorflow. ... The time for Big Transformer is measured on Ge Force GTX 1080. ... Our CPU is Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz supporting 32 processors. |
| Software Dependencies | Yes | We employ tokenized case-sensitive BLEU (Papineni et al., 2002) calculated using multi-bleu.perl as well as METEOR (Denkowski & Lavie, 2014)10 to evaluate translation quality. We used the METEOR library at https://www.cs.cmu.edu/~alavie/METEOR/download/meteor-1.5.tar.gz. ... Following previous work, we use NIST mteval-v13a.pl for BLEU and MSR rouge-1.5.5 for ROUGE. |
| Experiment Setup | Yes | By default, we adopt the base setting (Vaswani et al., 2017): the model size d is 512, the middle layer size of FFN( ) is 2048, and the head number is 8. We use Adam (Kingma & Ba, 2015) (β1 = 0.9, β2 = 0.98, and ϵ = 10 8) to train model parameters, with a batch size of roughly 25000 target tokens. We schedule the learning rate using the inverse square root of running steps, with a warmup step of 4000. To avoid over-fitting, we employ label smoothing with ϵls = 0.1, attention dropout and residual dropout with a rate of p = 0.1. We use the beam search algorithm for decoding and set the beam size to 4 and the length penalty to 0.6. We set α, β and γ in AAN+ to 0.1 according to our preliminary experiments. |