Synthetic Treebanking for Cross-Lingual Dependency Parsing
Authors: Jörg Tiedemann, Zeljko Agić
JAIR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present competitive results in cross-lingual dependency parsing using a combination of various techniques that contribute to the overall success of the method. We further include a detailed discussion about the impact of part-of-speech label accuracy on parsing results that provide guidance in practical applications of cross-lingual methods for truly under-resourced languages. ... In this section, we will discuss a series of experiments that systematically explore various cross-lingual parsing models based on annotation projection and treebank translation. Here, we only assess the properties of the specific approach, and we compare them intrinsically or to the baseline. |
| Researcher Affiliation | Academia | J org Tiedemann EMAIL Department of Modern Languages, University of Helsinki P.O. Box 24, FI-00014 University of Helsinki, Finland; ˇZeljko Agi c EMAIL Center for Language Technology, University of Copenhagen Njalsgade 140, 2300 Copenhagen S, Denmark |
| Pseudocode | Yes | The procedure is summarized in the pseudo-code shown in Figure 6. ... Figure 6: Annotation projection without DUMMY nodes proposed by Tiedemann et al. (2014). |
| Open Source Code | No | The paper discusses the use of several third-party tools such as 'mate-tools', 'GIZA++', 'Moses', 'Ken LM tools', 'Hun Pos', 'Malt Parser', and 'Malt Optimizer' but does not explicitly state that the authors' own implementation code for the methodology described in this paper is publicly available or provide a link to it. |
| Open Datasets | Yes | In our setup, we always use the test sets provided by the Universal Dependency Treebank version 1 (UDT) (Mc Donald et al., 2013) with their cross-lingually harmonized annotation... In our experiments, we use Europarl (Koehn, 2005) for each language pair following the basic setup of Tiedemann (2014)... The SMT community typically experiments with the Europarl dataset (Koehn, 2005), while many other datasets are also freely available and cover many more languages, such as the OPUS collection (Tiedemann, 2012)... For tuning we use MERT (Och, 2003) and the newstest2011 data provided by the annual workshop on statistical machine translation (WMT). |
| Dataset Splits | Yes | In our setup, we always use the test sets provided by the Universal Dependency Treebank version 1 (UDT) (Mc Donald et al., 2013) with their cross-lingually harmonized annotation... The baseline model applies the DCA projection heuristics as presented by Hwa et al. (2005) and the first 40,000 sentences of each bitext in the corpus (repetitions of sentences included)... Discarding trees that include DUMMY nodes; results with 40,000 accepted trees... We experiment with improved annotation projection (see Section 3.2), and we introduce up to 60 thousand sentences with projected dependency trees. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It mentions using various software tools for processing and training but no hardware. |
| Software Dependencies | No | The paper mentions several software tools like 'mate-tools (Bohnet, 2010)', 'GIZA++ (Och & Ney, 2003)', 'Moses (Koehn et al., 2007)', 'MERT (Och, 2003)', 'Ken LM tools (Heafield et al., 2013)', 'Hun Pos (Hal acsy et al., 2007)', 'Malt Parser', and 'Malt Optimizer (Ballesteros & Nivre, 2012)'. However, it does not specify explicit version numbers for any of these tools. |
| Experiment Setup | Yes | Our MT setup is very generic and uses the Moses toolbox for training, tuning and decoding (Koehn et al., 2007). The translation models are trained on the entire Europarl corpus version 7 without language-pair-specific optimization. For tuning we use MERT (Och, 2003) and the newstest2011 data... The language model is a standard 5-gram model and is based on a combination of Europarl and News data provided from the same source. We apply modified Kneser-Ney smoothing without pruning, applying Ken LM tools (Heafield et al., 2013) for estimating the LM parameters... restrict the number of non-terminals on the right-hand side of extracted rules to three. Furthermore, we allow consecutive non-terminals on the source side to increase coverage. |