Bilingual Distributed Word Representations from Document-Aligned Comparable Data

Authors: Ivan Vulić, Marie-Francine Moens

JAIR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the Mu PTM-based and contextcounting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.
Researcher Affiliation Academia Ivan Vuli c EMAIL University of Cambridge Department of Theoretical and Applied Linguistics 9 West Road, CB3 9DP, Cambridge, UK Marie-Francine Moens EMAIL KU Leuven Department of Computer Science Celestijnenlaan 200A, 3001 Heverlee, Belgium
Pseudocode No The paper describes methods and models using mathematical formulations and textual explanations, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes We will make our pre-training and training code for BWESG publicly available, along with all BWESG-based bilingual word embeddings for the three language pairs at: http://liir.cs.kuleuven.be/software.php.
Open Datasets Yes To induce bilingual word embeddings as well as to be directly comparable with baseline representations from prior work, we use a dataset comprising a subset of comparable Wikipedia data available in three language pairs (Vuli c & Moens, 2013b, 2014)... Available online: people.cs.kuleuven.be/~ivan.vulic/software/ ...we use Europarl.v7 (Koehn, 2005) for all three language pairs obtained from the OPUS website (Tiedemann, 2012).9 http://opus.lingfil.uu.se/
Dataset Splits Yes Test Data For each language pair, we evaluate on standard 1,000 ground truth one-to-one translation pairs built for the three language pairs (ES/IT/NL-EN) by Vuli c and Moens (2013a, 2013b). Test Data We use the SWTC test set introduced recently (Vuli c & Moens, 2014). The test set comprises 15 polysemous nouns in three languages (ES, IT and NL) along with sets of their translation candidates (i.e., sets T C). For each polysemous noun, the test sets provide 24 sentences extracted from Wikipedia which illustrate different senses and translations of the pivot polysemous noun, accompanied by the annotated correct translation for each sentence. It yields 360 test sentences for each language pair (and 1080 test sentences in total). An additional set of 100 IT sentences (5 other polysemous IT nouns plus 20 sentences for each noun) is used as a development set to tune the parameter λ (see Section 5.1) for all language pairs and all models in comparison.
Hardware Specification Yes Typically, several hours are needed to train BWESG with d = 300 and cs 48 60, whereas it takes two to three days to train a bilingual topic model with K = 2000 on the same training set using the multi-threaded architectures on 10 Intel(R) Xeon(R) CPU E5-2667 2.90GHz processors.
Software Dependencies No The paper mentions software like "SGNS from the word2vec package" and "Tree Tagger (Schmid, 1994)", but does not provide specific version numbers for these tools to ensure reproducibility of the software environment.
Experiment Setup Yes All parameters are set to default suggested parameters for SGNS from the word2vec package: stochastic gradient descent (SGD) with a linearly decreasing global learning rate of 0.025, 25 negative samples, subsampling rate 1e 4, and 15 epochs. We have varied the number of dimensions d = 100, 200, 300. We have also trained BWESG with d = 40 to be directly comparable to readily available sets of BWEs from prior work (Chandar et al., 2014). Moreover, to test the effect of window size on the final results, i.e., the number of positives used for training, we have varied the maximum window size cs from 4 to 60 in steps of 4.