Automatic Wordnet Development for Low-Resource Languages using Cross-Lingual WSD

Authors: Nasrin Taghizadeh, Hesham Faili

JAIR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed method has been executed with Persian language and the resulting wordnet has been evaluated through several experiments. The results show that the induced wordnet has a precision score of 90% and a recall score of 35%.
Researcher Affiliation Academia Nasrin Taghizadeh EMAIL School of Electrical and Computer Engineering College of Engineering, University of Tehran, Tehran, Iran Hesham Faili EMAIL School of Electrical and Computer Engineering College of Engineering, University of Tehran, Tehran, Iran
Pseudocode No The paper describes the EM algorithm and its steps (Expectation Step, Maximization Step) verbally and with flowcharts (Figure 1 and Figure 2), but does not present a distinct block of pseudocode or a clearly labeled algorithm section with structured steps.
Open Source Code Yes The source code is freely available for download at http://ece.ut.ac.ir/en/node/940
Open Datasets Yes To construct a wordnet for the Persian language, the Bijankhan Persian corpus2 has been used. This collection has been gathered from daily news and common texts, in which all documents are categorized into different subjects such as political, cultural and so on. Bijankhan contains about ten million manually-tagged words with a tag set containing 550 fine-grained Persian POS tags (Oroumchian, Tasharofi, Amiri, Hojjat, & Raja, 2006). 2. See http://ece.ut.ac.ir/dbrg/bijankhan/ Core Word Net (2015) http://wordnetcode.princeton.edu/standoff-files/core-wordnet.txt.
Dataset Splits Yes In order to evaluate the behaviour of the proposed method when the corpus size is limited, a part of the Bijankhan has been picked for training Persian wordnet. So both the PMI-based and the graph-based method have been conducted using this part. This part includes nearly 13% of the total size of the corpus. The remaining 87% has been used in the testing phase in which coverage of the wordnet over the corpus was evaluated.
Hardware Specification No The paper mentions running experiments but does not provide specific details about the hardware used (e.g., CPU, GPU models, memory specifications).
Software Dependencies No The paper mentions using 'Word Net version 3.0' and 'STe P-1 tool', but it does not specify software dependencies for implementation with version numbers (e.g., Python, PyTorch, CUDA versions), nor multiple key software components with versions.
Experiment Setup Yes In the WSD procedure, the context of each word is the sentence containing that word. A depth-first search in WSD has been performed up to a maximum depth of 3 similar to the work of Navigli and Ponzetto (2012b). As mentioned before in Section 3.2, if the probability of the Word Net sense s given for the word w is less than or equal to t, that sense is ignored in the WSD process of the EM algorithm. In our experiments, we have set t = 0.005. Also in each iteration, those links with a current score below t are ignored and the corresponding senses are not presented in the graph s construction and the WSD procedure. At the end, those words in the target language that are mapped onto the same synset in the Word Net make synsets of the resulting wordnet. After execution of the EM algorithm, the probability of assigning each candidate synset to each word in the target language is finalized. These probabilities are sorted and those links with a probability under the threshold tremove should be removed from the final wordnet. The value of tremove determines the size of the wordnet and affects the quality of the wordnet. So, experiments were conducted that used different values for the tremove including 0.1, 0.05, 0.02, 0.01, 0.005 and 0.0.