Multilingual Machine Translation: Deep Analysis of Language-Specific Encoder-Decoders
Authors: Carlos Escolano, Marta R. Costa-jussà , José A. R. Fonollosa
JAIR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposal on three experimental configurations: translation for the jointly trained initial languages, translation when incrementally training a new language, and zero-shot translation ( 4). Our results show that the proposed method is competitive in the first two configurations, but still lags behind the shared encoder-decoder in zero-shot translation. In Section 4 overviews experiments in machine translation where we compare our proposed method to the shared encoder/decoder which is considered the state-of-the-art in current multilingual approaches. Section 5 provides an in-depth analysis of the intermediate representations created with our proposed method by showing the results in natural language inference and visualizing some intermediate sentence representations. |
| Researcher Affiliation | Academia | Carlos Escolano EMAIL Marta R. Costa-juss a EMAIL Jos e A. R. Fonollosa EMAIL TALP Research Center, Universitat Polit ecnica de Catalunya, Barcelona |
| Pseudocode | Yes | Algorithm 1 Multilingual training step 1: procedure MULTILINGUALTRAININGSTEP 2: N Number of languages in the system 3: S = {s0,0, ..., s N,N} {(ei, dj)} 4: E = {e0, ..., e N} Language-specific encs. 5: D = {d0, ..., d N} Language-specific decs. 6: for i 0 to N do 7: for j 0 to N do 8: if si,j S then 9: li, lj = get parallel batch(i, j) 10: train(si,j(ei, dj), li, lj) |
| Open Source Code | No | The paper mentions using Fairseq for Transformer implementation (Release v0.6.0 available at https://github.com/pytorch/fairseq) and a visualization tool from prior work (https://github.com/elorala/interlingua-visualization). However, it does not explicitly state that the source code for the proposed language-specific encoder-decoder methodology described in this paper is openly available. |
| Open Datasets | Yes | We used 2 million sentences from the Euro Parl corpus (Koehn, 2005) in German, French, Spanish and English as training data... For Russian-English, we used 1 million training sentences from the Yandex corpus2. https://translate.yandex.ru/corpus?lang=en. As validation and test set, we used newstest2012 and newstest2013 from WMT3. http://www.statmt.org. For this task, we use the Multi NLI corpus 5 for training, which contains approximately 430k entries. https://cims.nyu.edu/ sbowman/multinli/ We use the XNLI validation and test set (Conneau et al., 2018) for cross-lingual results |
| Dataset Splits | Yes | We used 2 million sentences from the Euro Parl corpus... as training data... As validation and test set, we used newstest2012 and newstest2013 from WMT. For this task, we use the Multi NLI corpus 5 for training... We use the XNLI validation and test set (Conneau et al., 2018) for cross-lingual results, which contain 2.5k and 5k segments, respectively, for each language. |
| Hardware Specification | Yes | All experiments were performed on an NVIDIA Titan X GPU with 12 GB of memory. |
| Software Dependencies | Yes | All the experiments were done using the Transformer implementation provided by Fairseq4. Release v0.6.0 available at https://github.com/pytorch/fairseq |
| Experiment Setup | Yes | We used 6 layers, each with 8 attention heads, an embedding size of 512 dimensions, 2048 hidden size feedforward layers and a vocabulary size of 32k subword tokens with Byte Pair Encoding (Sennrich, Haddow, & Birch, 2016)... The dropout was 0.1 for the shared approach and 0.3 for language-specific encoders/decoders. Both approaches were trained with an effective batch size of 32k tokens for approximately 200k updates, using the validation loss for early stopping. In all cases, we used Adam (Kingma & Ba, 2015) as the optimizer, with learning rate of 0.001 and 4000 warmup steps. |