reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Supervised Visual Attention for Simultaneous Multimodal Machine Translation

Authors: Veneta Haralampieva, Ozan Caglayan, Lucia Specia

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform comprehensive experiments on three language directions and conduct thorough quantitative and qualitative analyses using both automatic metrics and manual inspection. Our results show that (i) supervised visual attention consistently improves the translation quality of the simultaneous MMT models, and (ii) fine-tuning the MMT with supervision loss enabled leads to better performance than training the MMT from scratch. Compared to the state-of-the-art, our proposed model achieves improvements of up to 2.3 BLEU and 3.5 METEOR points.
Researcher Affiliation	Academia	Veneta Haralampieva EMAIL Ozan Caglayan (Corresponding author) EMAIL Lucia Specia EMAIL Department of Computing, Imperial College London, UK
Pseudocode	Yes	Algorithm 1: Prefix training (Niehues et al., 2018; Arivazhagan et al., 2020)
Open Source Code	No	The paper does not provide an explicit statement of code release for their methodology or a link to their own code repository. It only references a third-party model's Docker Hub link.
Open Datasets	Yes	We use the Multi30k dataset (Elliott et al., 2016)4, which is a multi-lingual extension to the Flickr30k image captioning dataset (Young et al., 2014). 4. https://github.com/multi30k/dataset
Dataset Splits	Yes	Table 1: Multi30k dataset statistics: Words denote the total number of words in a split whereas Len is the average number of words per sentence in that split. train 380K 13.1 364K 12.6 416K 14.4 298K 10.3 29,000 val 13.4K 13.2 13.1K 12.9 14.6K 14.4 10.4K 10.2 1,014 test2016 13.0K 13.1 12.2K 12.2 14.2K 14.2 10.5K 10.5 1,000 test2017 11.4K 11.4 10.9K 10.9 12.8K 12.8 1,000 test COCO 5.2K 11.4 5.2K 11.2 5.8K 12.5 461
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU model, CPU type) used for running its experiments. It mentions the ResNet-101 backend for the BUTD detector but not the experimental hardware.
Software Dependencies	No	The paper mentions using "Moses tools" and "NLTK toolkit" but does not specify their version numbers. It also refers to "Adam" optimizer and "noam scheduler" but these are algorithms, not software dependencies with specific versions.
Experiment Setup	Yes	We use the Base Transformer (Vaswani et al., 2017) configuration in all our experiments, where both the encoder and decoder have 6 layers (B = 6 in Figure 2), each attention layer has 8 heads, the model dimension is 512 and the feed forward layer size is 2048. Additionally, we share the parameters of the target and output language embedding matrix (Press & Wolf, 2017). ... During training, we optimise the models using Adam (Kingma & Ba, 2014) and decay the learning rate with the noam scheduler (Vaswani et al., 2017). The initial learning rate, β1 and β2 are 0.2, 0.9 and 0.98, respectively. The learning rate is warmed up for 4,000 steps. We use a batch size of 32, apply label smoothing with ϵ = 0.1 (Szegedy et al., 2016) and clip the gradients so that their norm is 1 (Pascanu et al., 2014). We train each system 3 times with different random seeds for a maximum of 100 epochs, with early stopping based on the validation METEOR (Denkowski & Lavie, 2014) score, which is the official metric used in all shared tasks in MMT (Barrault et al., 2018). The best checkpoint with respect to validation METEOR is selected to decode test set translations using the greedy search algorithm. ... For these particular variants, we disable the learning rate scheduling and lower the learning rate to 1e 5.