A Set of Recommendations for Assessing Human–Machine Parity in Language Translation

Authors: Samuel Läubli, Sheila Castilho, Graham Neubig, Rico Sennrich, Qinlan Shen, Antonio Toral

JAIR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our paper investigates three aspects of human MT evaluation, with a special focus on assessing human machine parity: the choice of raters, the use of linguistic context, and the creation of reference translations. We focus on the data shared by Hassan et al. (2018), and empirically test to what extent changes in the evaluation design affect the outcome of the human evaluation. Based on our empirical findings, we formulate a set of recommendations for human MT evaluation in general, and assessing human machine parity in particular. All of our data are made publicly available for external validation and further analysis.
Researcher Affiliation Academia Samuel L aubli EMAIL Institute of Computational Linguistics, University of Zurich Sheila Castilho EMAIL ADAPT Centre, Dublin City University Graham Neubig EMAIL Language Technologies Institute, Carnegie Mellon University Rico Sennrich EMAIL Institute of Computational Linguistics, University of Zurich Qinlan Shen EMAIL Language Technologies Institute, Carnegie Mellon University Antonio Toral EMAIL Center for Language and Cognition, University of Groningen
Pseudocode No The paper describes evaluation protocols and methodologies in natural language text and flowcharts, but does not include any structured pseudocode or algorithm blocks.
Open Source Code No All of our data are made publicly available for external validation and further analysis.2 https://github.com/Zurich NLP/mt-parity-assessment-data. The paper explicitly states that *data* is made publicly available, not the source code for the methodology.
Open Datasets Yes All of our data are made publicly available for external validation and further analysis.2 https://github.com/Zurich NLP/mt-parity-assessment-data. We use English translations of the Chinese source texts in the WMT 2017 English Chinese test set (Bojar et al., 2017) for all experiments presented in this article.
Dataset Splits Yes We conduct a relative ranking experiment using one professional human (HA) and two machine translations (MT1 and MT2), considering the native Chinese part of the WMT 2017 Chinese English test set (see Section 5.2 for details). The 299 sentences used in the experiments stem from 41 documents, randomly selected from all the documents in the test set originally written in Chinese, and are shown in their original order. [...] In each condition, four raters evaluate 50 documents (plus 5 spam items) and 104 sentences (plus 16 spam items). We use two non-overlapping sets of documents and two non-overlapping sets of sentences, and each is evaluated by two raters.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It focuses on evaluating MT systems rather than detailing the computational resources for its own analysis.
Software Dependencies No The paper mentions using "Appraise (Federmann, 2012)" and the "True Skill method adapted to MT evaluation (Sakaguchi, Post, & Van Durme, 2014) following its usage at WMT15,4 i. e., we run 1,000 iterations of the rankings recorded with Appraise followed by clustering (significance level α = 0.05)." It also provides a link to the WMT15 True Skill implementation: "https://github.com/mjpost/wmt15". While these tools are named, specific version numbers for these or other software dependencies are not provided.
Experiment Setup Yes We conduct a relative ranking experiment using one professional human (HA) and two machine translations (MT1 and MT2), considering the native Chinese part of the WMT 2017 Chinese English test set (see Section 5.2 for details). The 299 sentences used in the experiments stem from 41 documents, randomly selected from all the documents in the test set originally written in Chinese, and are shown in their original order. Raters are shown one sentence at a time, and see the original Chinese source alongside the three translations. The previous and next source sentences are also shown, in order to provide the annotator with local inter-sentential context. Five raters two experts and three non-experts participated in the assessment. The ratings are elicited with Appraise (Federmann, 2012). We derive an overall score for each translation (HA, MT1, and MT2) based on the rankings. We use the True Skill method adapted to MT evaluation (Sakaguchi, Post, & Van Durme, 2014) following its usage at WMT15,4 i. e., we run 1,000 iterations of the rankings recorded with Appraise followed by clustering (significance level α = 0.05). In a pairwise ranking experiment, we show raters (i) isolated sentences and (ii) entire documents, asking them to choose the better (with ties allowed) from two translation outputs: one produced by a professional translator, the other by a machine translation system. [...] We use spam items for quality control (Kittur, Chi, & Suh, 2008).