Transformer Architecture Search for Improving Out-of-Domain Generalization in Machine Translation
Authors: Yiheng He, Ruiyi Zhang, Sai Ashish Somayajula, Pengtao Xie
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluated across multiple datasets, our method demonstrates strong OOD generalization performance, surpassing state-of-the-art approaches. Our method significantly outperforms both the vanilla Transformer architecture and NAS baselines on OOD tasks. For instance, our approach enhanced the baseline method s performance by 29% in BLEU score when assessed using the Gnome (English-Igbo) dataset, and achieved a 38% improvement in BLEU score on the Ubuntu (English-Igbo) dataset. We conducted comprehensive experiments across both OOD and in-domain MT tasks. Our method significantly outperforms vanilla Transformer and prior NAS-based Transformer architectures. Extensive ablation studies further validate the importance of each component in our framework. |
| Researcher Affiliation | Academia | Yiheng He EMAIL UC San Diego Ruiyi Zhang EMAIL UC San Diego Sai Ashish Somayajula EMAIL UC San Diego Pengtao Xie EMAIL UC San Diego |
| Pseudocode | Yes | Algorithm 1 Optimization Algorithm Initialize a model with weights W and architecture A. while not converged do 1. Update weights W with equation (7) 2. Update perturbation δ with equation (8) 3. Update architecture A with equation (9) end while Derive the final architecture based on learned A. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/yihenghe/transformer_nas. |
| Open Datasets | Yes | We evaluate OOD generalization performance on low-resource languages, including English-Igbo (En-Ig), English-Hausa (En-Ha), and English-Irish (En-Ga) language pairs. Following Ahia et al. (2021), we obtain training data for En-Ig, En-Ha and En-Ga from the CCMatrix parallel corpus Schwenk et al. (2019), which offers the largest collection of high-quality, web-based bitexts for machine translation. The test data for En-Ig and En-Ha is from the Gnome and Ubuntu datasets, and the test data for En-Ga is from the Flores dataset, all considered OOD for CCMatrix. We include detailed description of these datasets in Appendix D. Furthermore, we conduct experiments on high-resource languages, following So et al. (2019) and Zhao et al. (2021), using the following training datasets: 1) WMT18 English-German (En-De) without Para Crawl, consisting of 4.5 million sentence pairs; 2) WMT14 English-French (En-Fr), comprising 36 million sentence pairs; and 3) WMT18 English-Czech (En-Cs) without Para Crawl, with 15.8 million sentence pairs. The test data for these language pairs is from the WMT-Chat and WMT-Biomedical datasets, both considered OOD for WMT14 and WMT18. Detailed descriptions can be found in Appendix D. |
| Dataset Splits | Yes | Each of the three datasets, including En-Ig (CCmatrix), En-Ha (CCmatrix), and En-De (WMT), is divided into training, validation, and test splits. Models are trained on the training split, evaluated on the test split (results shown in this table), and the validation split is used for tuning hyperparameters. |
| Hardware Specification | Yes | All the experiments were conducted on Nvidia A100 GPU. |
| Software Dependencies | No | The Adam Kingma & Ba (2014) optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 10 9 is used. Tokenization is performed using Moses1, which is a rule-based tokenizer. |
| Experiment Setup | Yes | Search configuration. We follow the settings in Zhao et al. (2021) for architecture search. Both the vanilla DARTS and Ours-darts utilize two identical encoder and two identical decoder layers during the search phase. Each encoder layer consists of Self Attention Search Node Search Node , whereas each decoder layer consists of Self Attention Cross Attention Search Node Search Node . Only the Search Node is searched, while the architectures of Self Attention and Cross Attention are fixed. Zhao et al. (2021) empirically shows that architecture search with this configuration yields better results. The layers with searched architecture are stacked to construct the final model with six encoder layers and six decoder layers. PDARTS and Ours-pdarts adopt a progressive learning approach, where the number of encoder and decoder layers increases from 2 to 4 to 6 through the search process. Simultaneously, the size of the operation set in the encoder layer is reduced from 15 to 10 to 5, and in the decoder layer from 16 to 11 to 6. Hyperparameter settings. Following Vaswani et al. (2017), we utilize 6 encoder and decoder layers, a hidden size of 512, a filter size of 2048, and 8 attention heads for vanilla Transformers, DARTS, PDARTS, Ours-darts and Ours-pdarts. For Ours-darts and Ours-pdarts, the radial basis function (RBF) kernel is used to compute maximum mean discrepancy (MMD) used in Stage II. The tradeoff parameter λ is set to 1.5. For the optimization of W in our methods and baselines, the same hyperparameter settings as Vaswani et al. (2017) are used, including the learning rate and its scheduler, with warm-up steps set to 4000. The Adam Kingma & Ba (2014) optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 10 9 is used. We increase the learning rate linearly for the first warmup training steps, and decrease it thereafter proportionally to the inverse square root of the step number. For the optimization of architecture weights A, both our methods and the baselines use the same hyperparameters following Liu et al. (2018), employing a constant learning rate of 3 10 4 and a weight decay of 10 3, with the Adam optimizer (β1 = 0.9, β2 = 0.98, and ϵ = 10 9) being used. The perturbation δ, in our framework, is optimized using an Adam optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 10 9 and a constant learning rate of 10 3. We set the dropout rate to 0.1 and employ label smoothing with a value of 0.1 during architecture search. We set the batch size to 4096 for all experiments. The maximum sentence length is set to 256 for all experiments. Tokenization is performed using Moses1, which is a rule-based tokenizer. BLEU Papineni et al. (2002) is used as the evaluation metric. We employ beam search during inference with a beam size of 4 and a length penalty α = 0.6. |