Supervised Visual Attention for Simultaneous Multimodal Machine Translation

Authors: Veneta Haralampieva, Ozan Caglayan, Lucia Specia

JAIR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform comprehensive experiments on three language directions and conduct thorough quantitative and qualitative analyses using both automatic metrics and manual inspection. Our results show that (i) supervised visual attention consistently improves the translation quality of the simultaneous MMT models, and (ii) fine-tuning the MMT with supervision loss enabled leads to better performance than training the MMT from scratch. Compared to the state-of-the-art, our proposed model achieves improvements of up to 2.3 BLEU and 3.5 METEOR points.
Researcher Affiliation Academia Veneta Haralampieva EMAIL Ozan Caglayan (Corresponding author) EMAIL Lucia Specia EMAIL Department of Computing, Imperial College London, UK
Pseudocode Yes Algorithm 1: Prefix training (Niehues et al., 2018; Arivazhagan et al., 2020)
Open Source Code No The paper does not provide an explicit statement of code release for their methodology or a link to their own code repository. It only references a third-party model's Docker Hub link.
Open Datasets Yes We use the Multi30k dataset (Elliott et al., 2016)4, which is a multi-lingual extension to the Flickr30k image captioning dataset (Young et al., 2014). 4. https://github.com/multi30k/dataset
Dataset Splits Yes Table 1: Multi30k dataset statistics: Words denote the total number of words in a split whereas Len is the average number of words per sentence in that split. train 380K 13.1 364K 12.6 416K 14.4 298K 10.3 29,000 val 13.4K 13.2 13.1K 12.9 14.6K 14.4 10.4K 10.2 1,014 test2016 13.0K 13.1 12.2K 12.2 14.2K 14.2 10.5K 10.5 1,000 test2017 11.4K 11.4 10.9K 10.9 12.8K 12.8 1,000 test COCO 5.2K 11.4 5.2K 11.2 5.8K 12.5 461
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU model, CPU type) used for running its experiments. It mentions the ResNet-101 backend for the BUTD detector but not the experimental hardware.
Software Dependencies No The paper mentions using "Moses tools" and "NLTK toolkit" but does not specify their version numbers. It also refers to "Adam" optimizer and "noam scheduler" but these are algorithms, not software dependencies with specific versions.
Experiment Setup Yes We use the Base Transformer (Vaswani et al., 2017) configuration in all our experiments, where both the encoder and decoder have 6 layers (B = 6 in Figure 2), each attention layer has 8 heads, the model dimension is 512 and the feed forward layer size is 2048. Additionally, we share the parameters of the target and output language embedding matrix (Press & Wolf, 2017). ... During training, we optimise the models using Adam (Kingma & Ba, 2014) and decay the learning rate with the noam scheduler (Vaswani et al., 2017). The initial learning rate, β1 and β2 are 0.2, 0.9 and 0.98, respectively. The learning rate is warmed up for 4,000 steps. We use a batch size of 32, apply label smoothing with ϵ = 0.1 (Szegedy et al., 2016) and clip the gradients so that their norm is 1 (Pascanu et al., 2014). We train each system 3 times with different random seeds for a maximum of 100 epochs, with early stopping based on the validation METEOR (Denkowski & Lavie, 2014) score, which is the official metric used in all shared tasks in MMT (Barrault et al., 2018). The best checkpoint with respect to validation METEOR is selected to decode test set translations using the greedy search algorithm. ... For these particular variants, we disable the learning rate scheduling and lower the learning rate to 1e 5.