News Across Languages - Cross-Lingual Document Similarity and Event Tracking
Authors: Jan Rupnik, Andrej Muhic, Gregor Leban, Primoz Skraba, Blaz Fortuna, Marko Grobelnik
JAIR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide an extensive evaluation of the system as a whole, as well as an evaluation of the quality and robustness of the similarity measure and the linking algorithm. In Section 6, we present and interpret the experimental results. To evaluate the prediction accuracy for a given dataset we used 10-fold cross validation. The results of the trained models are shown in Table 4. |
| Researcher Affiliation | Academia | Artificial Intelligence Laboratory, Joˇzef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia |
| Pseudocode | Yes | Algorithm 1: Algorithm for identifying candidate clusters C that are potentially equivalent to ci |
| Open Source Code | Yes | We have made both the code and data that were used in the experiments publicly available at https://github.com/rupnikj/jair_paper.git. The included archive contains two folders: positive and negative , where the first folder includes examples of cluster pairs in two languages that represent the same event and the second folder contains pairs of clusters in two languages that do not represent different events. Each example is a JSON file that contains at the top level information about a pair of clusters (including text of the articles) as well as a set of meta attributes, that correspond to features described in Section 5.2. The code folder includes MATLAB scripts for building cross-lingual similarity models introduced in 4.2, which can be used with publicly available Wikipedia corpus to reproduce the cross-lingual similarity evaluation. |
| Open Datasets | Yes | To investigate the empirical performance of the low-rank approximations we will test the algorithms on a large-scale, real-world multilingual dataset that we extracted from Wikipedia by using inter-language links for alignment. We have made both the code and data that were used in the experiments publicly available at https://github.com/rupnikj/jair_paper.git. The manually labelled dataset used in the evaluation of event linking is available at in the dataset subfolder of the github repository. |
| Dataset Splits | Yes | The evaluation is based on splitting the data into training and test sets. We select the test set documents as all multilingual documents with at least one nonempty alignment from the list: (hi, ht), (hi, pms), (war, ht), (war, pms). The remaining documents are used for training. We computed the Average (over language pairs) Mean Reciprocal Rank (AMRR) (Voorhees et al., 1999) performance of the different approaches on the Wikipedia data by holding out 15, 000 aligned test documents and using 300, 000 aligned documents as the training set. To evaluate the prediction accuracy for a given dataset we used 10-fold cross validation. |
| Hardware Specification | Yes | The similarity pipeline is the most computationally intensive part and currently runs on a machine with two Intel Xeon E5-2667 v2, 3.30GHz processors with 256GB of RAM. |
| Software Dependencies | No | For all linear algebra matrix and vector operations, we use high performance numerical linear algebra libraries as BLAS, OPENBLAS and Intel MKL, which currently allows us to process more than one million articles per day. |
| Experiment Setup | Yes | For clustering, each new article is first tokenized, stop words are removed and the remaining words are stemmed. The remaining tokens are represented in a vector-space model and normalized using TF-IDF (see Section 4.1 for the definition). Cosine similarity is used to find the most similar existing cluster, by comparing the document s vector to the centroid vector of each cluster. A user-defined threshold is used to determine if the article is not similar enough to any existing clusters (0.4 was used in our experiments). The classification algorithm that we used to train a model was a linear Support Vector Machine (SVM) method (Shawe-Taylor & Cristianini, 2004). We note that taking k = 500 or k = 1, 000 multilingual topics usually results in similar performance. |