The Impact of Syntactic and Semantic Proximity on Machine Translation with Back-Translation
Authors: Nicolas Guerin, Emmanuel Chemla, Shane Steinert-Threlkeld
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical, syntactic, and semantic properties. We find, contrary to popular belief, that (i) parallel word frequency distributions, (ii) partially shared vocabulary, and (iii) similar syntactic structure across languages are not sufficient to explain the success of back-translation. We show however that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through backtranslation. |
| Researcher Affiliation | Academia | Nicolas Guerin EMAIL Laboratoire de Sciences Cognitives et Psycholinguistique École Normale Supérieure PSL University Emmanuel Chemla EMAIL Laboratoire de Sciences Cognitives et Psycholinguistique École Normale Supérieure PSL University Shane Steinert-Threlkeld EMAIL Department of Linguistics University of Washington |
| Pseudocode | No | The paper describes the model components and objectives mathematically and visually in Figure 1 and Figure 2, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository. It mentions using 'the exact model introduced by Lample et al. (2018c)' but this refers to a third-party model, not their own implementation's code. |
| Open Datasets | No | For training purposes, we generated two sets of 100,000 sentence structures2. All unsupervised training sets are made of (i) one of these sets of sentence structures for the 100,000 sentences in one language (using the language specific grammar and lexicon), and (ii) the other set to create 100,000 sentences in the other language. Hence, the data are neither labeled for supervision, nor are they even unlabelled translations from one another in principle. The test and validation sets are each composed of 10,000 parallel sentence pairs. These are generated in one language, and then transformed into the second language using the known grammar switches and lexical translations. The paper describes the generation of artificial languages and data sets for the experiments but does not provide specific access information (link, DOI, etc.) to these generated datasets. |
| Dataset Splits | Yes | For training purposes, we generated two sets of 100,000 sentence structures2. All unsupervised training sets are made of (i) one of these sets of sentence structures for the 100,000 sentences in one language (using the language specific grammar and lexicon), and (ii) the other set to create 100,000 sentences in the other language. Hence, the data are neither labeled for supervision, nor are they even unlabelled translations from one another in principle. The test and validation sets are each composed of 10,000 parallel sentence pairs. These are generated in one language, and then transformed into the second language using the known grammar switches and lexical translations. |
| Hardware Specification | Yes | We trained on a Tesla M10 with 8GB memory for roughly one day. |
| Software Dependencies | No | The paper mentions using 'the exact model introduced by Lample et al. (2018c) for NMT', 'Transformer encoder layers and 4 Transformer decoder layers', and 'Fast Text algorithm'. However, it does not specify version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | We then train the translation model for 40 epochs, with a batch size of 16 and an Adam optimizer with a 10 4 learning rate. Those hyper-parameters were chosen based on Lample et al. (2018c) and on our computational capabilities. We trained on a Tesla M10 with 8GB memory for roughly one day. |