reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Set of Recommendations for Assessing Human–Machine Parity in Language Translation

Authors: Samuel Läubli, Sheila Castilho, Graham Neubig, Rico Sennrich, Qinlan Shen, Antonio Toral

JAIR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our paper investigates three aspects of human MT evaluation, with a special focus on assessing human machine parity: the choice of raters, the use of linguistic context, and the creation of reference translations. We focus on the data shared by Hassan et al. (2018), and empirically test to what extent changes in the evaluation design aﬀect the outcome of the human evaluation. Based on our empirical ﬁndings, we formulate a set of recommendations for human MT evaluation in general, and assessing human machine parity in particular. All of our data are made publicly available for external validation and further analysis.
Researcher Affiliation	Academia	Samuel L aubli EMAIL Institute of Computational Linguistics, University of Zurich Sheila Castilho EMAIL ADAPT Centre, Dublin City University Graham Neubig EMAIL Language Technologies Institute, Carnegie Mellon University Rico Sennrich EMAIL Institute of Computational Linguistics, University of Zurich Qinlan Shen EMAIL Language Technologies Institute, Carnegie Mellon University Antonio Toral EMAIL Center for Language and Cognition, University of Groningen
Pseudocode	No	The paper describes evaluation protocols and methodologies in natural language text and flowcharts, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	All of our data are made publicly available for external validation and further analysis.2 https://github.com/Zurich NLP/mt-parity-assessment-data. The paper explicitly states that data is made publicly available, not the source code for the methodology.
Open Datasets	Yes	All of our data are made publicly available for external validation and further analysis.2 https://github.com/Zurich NLP/mt-parity-assessment-data. We use English translations of the Chinese source texts in the WMT 2017 English Chinese test set (Bojar et al., 2017) for all experiments presented in this article.
Dataset Splits	Yes	We conduct a relative ranking experiment using one professional human (HA) and two machine translations (MT1 and MT2), considering the native Chinese part of the WMT 2017 Chinese English test set (see Section 5.2 for details). The 299 sentences used in the experiments stem from 41 documents, randomly selected from all the documents in the test set originally written in Chinese, and are shown in their original order. [...] In each condition, four raters evaluate 50 documents (plus 5 spam items) and 104 sentences (plus 16 spam items). We use two non-overlapping sets of documents and two non-overlapping sets of sentences, and each is evaluated by two raters.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It focuses on evaluating MT systems rather than detailing the computational resources for its own analysis.
Software Dependencies	No	The paper mentions using "Appraise (Federmann, 2012)" and the "True Skill method adapted to MT evaluation (Sakaguchi, Post, & Van Durme, 2014) following its usage at WMT15,4 i. e., we run 1,000 iterations of the rankings recorded with Appraise followed by clustering (signiﬁcance level α = 0.05)." It also provides a link to the WMT15 True Skill implementation: "https://github.com/mjpost/wmt15". While these tools are named, specific version numbers for these or other software dependencies are not provided.
Experiment Setup	Yes	We conduct a relative ranking experiment using one professional human (HA) and two machine translations (MT1 and MT2), considering the native Chinese part of the WMT 2017 Chinese English test set (see Section 5.2 for details). The 299 sentences used in the experiments stem from 41 documents, randomly selected from all the documents in the test set originally written in Chinese, and are shown in their original order. Raters are shown one sentence at a time, and see the original Chinese source alongside the three translations. The previous and next source sentences are also shown, in order to provide the annotator with local inter-sentential context. Five raters two experts and three non-experts participated in the assessment. The ratings are elicited with Appraise (Federmann, 2012). We derive an overall score for each translation (HA, MT1, and MT2) based on the rankings. We use the True Skill method adapted to MT evaluation (Sakaguchi, Post, & Van Durme, 2014) following its usage at WMT15,4 i. e., we run 1,000 iterations of the rankings recorded with Appraise followed by clustering (signiﬁcance level α = 0.05). In a pairwise ranking experiment, we show raters (i) isolated sentences and (ii) entire documents, asking them to choose the better (with ties allowed) from two translation outputs: one produced by a professional translator, the other by a machine translation system. [...] We use spam items for quality control (Kittur, Chi, & Suh, 2008).