reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multilingual Machine Translation: Deep Analysis of Language-Specific Encoder-Decoders

Authors: Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our proposal on three experimental configurations: translation for the jointly trained initial languages, translation when incrementally training a new language, and zero-shot translation ( 4). Our results show that the proposed method is competitive in the first two configurations, but still lags behind the shared encoder-decoder in zero-shot translation. In Section 4 overviews experiments in machine translation where we compare our proposed method to the shared encoder/decoder which is considered the state-of-the-art in current multilingual approaches. Section 5 provides an in-depth analysis of the intermediate representations created with our proposed method by showing the results in natural language inference and visualizing some intermediate sentence representations.
Researcher Affiliation	Academia	Carlos Escolano EMAIL Marta R. Costa-juss a EMAIL Jos e A. R. Fonollosa EMAIL TALP Research Center, Universitat Polit ecnica de Catalunya, Barcelona
Pseudocode	Yes	Algorithm 1 Multilingual training step 1: procedure MULTILINGUALTRAININGSTEP 2: N Number of languages in the system 3: S = {s0,0, ..., s N,N} {(ei, dj)} 4: E = {e0, ..., e N} Language-specific encs. 5: D = {d0, ..., d N} Language-specific decs. 6: for i 0 to N do 7: for j 0 to N do 8: if si,j S then 9: li, lj = get parallel batch(i, j) 10: train(si,j(ei, dj), li, lj)
Open Source Code	No	The paper mentions using Fairseq for Transformer implementation (Release v0.6.0 available at https://github.com/pytorch/fairseq) and a visualization tool from prior work (https://github.com/elorala/interlingua-visualization). However, it does not explicitly state that the source code for the proposed language-specific encoder-decoder methodology described in this paper is openly available.
Open Datasets	Yes	We used 2 million sentences from the Euro Parl corpus (Koehn, 2005) in German, French, Spanish and English as training data... For Russian-English, we used 1 million training sentences from the Yandex corpus2. https://translate.yandex.ru/corpus?lang=en. As validation and test set, we used newstest2012 and newstest2013 from WMT3. http://www.statmt.org. For this task, we use the Multi NLI corpus 5 for training, which contains approximately 430k entries. https://cims.nyu.edu/ sbowman/multinli/ We use the XNLI validation and test set (Conneau et al., 2018) for cross-lingual results
Dataset Splits	Yes	We used 2 million sentences from the Euro Parl corpus... as training data... As validation and test set, we used newstest2012 and newstest2013 from WMT. For this task, we use the Multi NLI corpus 5 for training... We use the XNLI validation and test set (Conneau et al., 2018) for cross-lingual results, which contain 2.5k and 5k segments, respectively, for each language.
Hardware Specification	Yes	All experiments were performed on an NVIDIA Titan X GPU with 12 GB of memory.
Software Dependencies	Yes	All the experiments were done using the Transformer implementation provided by Fairseq4. Release v0.6.0 available at https://github.com/pytorch/fairseq
Experiment Setup	Yes	We used 6 layers, each with 8 attention heads, an embedding size of 512 dimensions, 2048 hidden size feedforward layers and a vocabulary size of 32k subword tokens with Byte Pair Encoding (Sennrich, Haddow, & Birch, 2016)... The dropout was 0.1 for the shared approach and 0.3 for language-specific encoders/decoders. Both approaches were trained with an effective batch size of 32k tokens for approximately 200k updates, using the validation loss for early stopping. In all cases, we used Adam (Kingma & Ba, 2015) as the optimizer, with learning rate of 0.001 and 4000 warmup steps.