reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Stacking Diverse Architectures to Improve Machine Translation

Authors: Andrea Schioppa, Nal Kalchbrenner

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On a large suite of machine translation tasks, we ﬁnd that Lasagna not only matches or outperforms the Transformer baseline, but it does so more eﬃciently thanks to widespread use of the eﬃcient convolutional blocks. These ﬁndings suggest that the widespread use of uniform architectures may be suboptimal in certain scenarios and exploiting the diversity of inductive architectural biases can lead to substantial gains. 4 Experiments
Researcher Affiliation	Industry	Andrea Schioppa EMAIL Google Research Nal Kalchbrenner EMAIL Google Research
Pseudocode	No	The paper describes the operations mathematically (Section 3.1) and with diagrams (e.g., Figure 1b) for the Gated Light Convolution layer, but does not provide explicit pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our code is available on Git Hub at google-research/lasagna_mt.
Open Datasets	Yes	As standard benchmarks we consider WMT 14 English to German (En De) and English to French (En Fr) as in (Vaswani et al., 2017). We prepare the data using the scripts in the Fairseq translation examples2. To test more distant language pairs, that could beneﬁt from more global inductive bias as captured by SA, we consider the WMT 17 Chinese to English (Zh En) benchmark, following the preprocessing in (Wu et al., 2019; Hassan et al., 2018) (about 20M pairs), and the WMT 17 English to Turkish (En Tr) benchmark following the preprocessing in (Zhang et al., 2018) (about 0.2M pairs).
Dataset Splits	No	The paper mentions using standard benchmarks like WMT 14 and WMT 17 and performing evaluation on newstest17, implying train/validation/test sets. It also states selecting checkpoints based on validation data. However, it does not explicitly provide specific percentages, sample counts, or direct citations for the train/validation/test splits used for these datasets within the main text.
Hardware Specification	Yes	We train and evaluate autoregressive Base models on V100 GPUs, whereas the use GPUs A100 for models of size Big.
Software Dependencies	No	The paper mentions using Fairseq (Ott et al., 2019) and PyTorch, but does not specify their version numbers or other key software dependencies with specific versions.
Experiment Setup	Yes	A complete hyper-parameter setup can be found in the Appendix C. For estimating inference speed we always decode 128 sentences in a single batch [...] We train models at two scales, Base and Big, as deﬁned in terms of the number of heads and channel dimensions in (Vaswani et al., 2017). ... Appendix C: Training Hyper-parameters We preprocess the data with Fairseq (Ott et al., 2019) using BPE vocabularies with splits and training hyper-parameters reported in Table 14. Note that we sometimes simulate training on a larger number of GPUs by using gradient accumulation steps, which in Fairseq are speciﬁed with the update-freq parameter. There are three dropouts: one for the MLPs, one for attention and one for the convolutional weights. For (En De) we use, respectively, 0.3, 0.1 and 0.1; for (En Fr) 0.1, 0.1, 0.1; for (Zh En) 0.2, 0.2, 0.2; for (En Tr )0.3, 0.1, 0.1. For the learning rate warmup we use a linear schedule to the peak rate with 4k steps at model size Base and (En Fr) at model size Big, 10k at model size Big for (En De) and (Zh En) and NAT. The embedding dimensions and the number of heads are the standard ones for model Base and Big. For Base we use 512 embedding dimensions and 8 heads, for Big 1024 embedding dimensions and 16 heads. The kernel sizes for Light and Dynamic Convolutions are like those in (Wu et al., 2019).