Stacking Diverse Architectures to Improve Machine Translation

Authors: Andrea Schioppa, Nal Kalchbrenner

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On a large suite of machine translation tasks, we find that Lasagna not only matches or outperforms the Transformer baseline, but it does so more efficiently thanks to widespread use of the efficient convolutional blocks. These findings suggest that the widespread use of uniform architectures may be suboptimal in certain scenarios and exploiting the diversity of inductive architectural biases can lead to substantial gains. 4 Experiments
Researcher Affiliation Industry Andrea Schioppa EMAIL Google Research Nal Kalchbrenner EMAIL Google Research
Pseudocode No The paper describes the operations mathematically (Section 3.1) and with diagrams (e.g., Figure 1b) for the Gated Light Convolution layer, but does not provide explicit pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Our code is available on Git Hub at google-research/lasagna_mt.
Open Datasets Yes As standard benchmarks we consider WMT 14 English to German (En De) and English to French (En Fr) as in (Vaswani et al., 2017). We prepare the data using the scripts in the Fairseq translation examples2. To test more distant language pairs, that could benefit from more global inductive bias as captured by SA, we consider the WMT 17 Chinese to English (Zh En) benchmark, following the preprocessing in (Wu et al., 2019; Hassan et al., 2018) (about 20M pairs), and the WMT 17 English to Turkish (En Tr) benchmark following the preprocessing in (Zhang et al., 2018) (about 0.2M pairs).
Dataset Splits No The paper mentions using standard benchmarks like WMT 14 and WMT 17 and performing evaluation on newstest17, implying train/validation/test sets. It also states selecting checkpoints based on validation data. However, it does not explicitly provide specific percentages, sample counts, or direct citations for the train/validation/test splits used for these datasets within the main text.
Hardware Specification Yes We train and evaluate autoregressive Base models on V100 GPUs, whereas the use GPUs A100 for models of size Big.
Software Dependencies No The paper mentions using Fairseq (Ott et al., 2019) and PyTorch, but does not specify their version numbers or other key software dependencies with specific versions.
Experiment Setup Yes A complete hyper-parameter setup can be found in the Appendix C. For estimating inference speed we always decode 128 sentences in a single batch [...] We train models at two scales, Base and Big, as defined in terms of the number of heads and channel dimensions in (Vaswani et al., 2017). ... Appendix C: Training Hyper-parameters We preprocess the data with Fairseq (Ott et al., 2019) using BPE vocabularies with splits and training hyper-parameters reported in Table 14. Note that we sometimes simulate training on a larger number of GPUs by using gradient accumulation steps, which in Fairseq are specified with the update-freq parameter. There are three dropouts: one for the MLPs, one for attention and one for the convolutional weights. For (En De) we use, respectively, 0.3, 0.1 and 0.1; for (En Fr) 0.1, 0.1, 0.1; for (Zh En) 0.2, 0.2, 0.2; for (En Tr )0.3, 0.1, 0.1. For the learning rate warmup we use a linear schedule to the peak rate with 4k steps at model size Base and (En Fr) at model size Big, 10k at model size Big for (En De) and (Zh En) and NAT. The embedding dimensions and the number of heads are the standard ones for model Base and Big. For Base we use 512 embedding dimensions and 8 heads, for Big 1024 embedding dimensions and 16 heads. The kernel sizes for Light and Dynamic Convolutions are like those in (Wu et al., 2019).