reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

News Across Languages - Cross-Lingual Document Similarity and Event Tracking

Authors: Jan Rupnik, Andrej Muhic, Gregor Leban, Primoz Skraba, Blaz Fortuna, Marko Grobelnik

JAIR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide an extensive evaluation of the system as a whole, as well as an evaluation of the quality and robustness of the similarity measure and the linking algorithm. In Section 6, we present and interpret the experimental results. To evaluate the prediction accuracy for a given dataset we used 10-fold cross validation. The results of the trained models are shown in Table 4.
Researcher Affiliation	Academia	Artiﬁcial Intelligence Laboratory, Joˇzef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
Pseudocode	Yes	Algorithm 1: Algorithm for identifying candidate clusters C that are potentially equivalent to ci
Open Source Code	Yes	We have made both the code and data that were used in the experiments publicly available at https://github.com/rupnikj/jair_paper.git. The included archive contains two folders: positive and negative , where the ﬁrst folder includes examples of cluster pairs in two languages that represent the same event and the second folder contains pairs of clusters in two languages that do not represent diﬀerent events. Each example is a JSON ﬁle that contains at the top level information about a pair of clusters (including text of the articles) as well as a set of meta attributes, that correspond to features described in Section 5.2. The code folder includes MATLAB scripts for building cross-lingual similarity models introduced in 4.2, which can be used with publicly available Wikipedia corpus to reproduce the cross-lingual similarity evaluation.
Open Datasets	Yes	To investigate the empirical performance of the low-rank approximations we will test the algorithms on a large-scale, real-world multilingual dataset that we extracted from Wikipedia by using inter-language links for alignment. We have made both the code and data that were used in the experiments publicly available at https://github.com/rupnikj/jair_paper.git. The manually labelled dataset used in the evaluation of event linking is available at in the dataset subfolder of the github repository.
Dataset Splits	Yes	The evaluation is based on splitting the data into training and test sets. We select the test set documents as all multilingual documents with at least one nonempty alignment from the list: (hi, ht), (hi, pms), (war, ht), (war, pms). The remaining documents are used for training. We computed the Average (over language pairs) Mean Reciprocal Rank (AMRR) (Voorhees et al., 1999) performance of the diﬀerent approaches on the Wikipedia data by holding out 15, 000 aligned test documents and using 300, 000 aligned documents as the training set. To evaluate the prediction accuracy for a given dataset we used 10-fold cross validation.
Hardware Specification	Yes	The similarity pipeline is the most computationally intensive part and currently runs on a machine with two Intel Xeon E5-2667 v2, 3.30GHz processors with 256GB of RAM.
Software Dependencies	No	For all linear algebra matrix and vector operations, we use high performance numerical linear algebra libraries as BLAS, OPENBLAS and Intel MKL, which currently allows us to process more than one million articles per day.
Experiment Setup	Yes	For clustering, each new article is ﬁrst tokenized, stop words are removed and the remaining words are stemmed. The remaining tokens are represented in a vector-space model and normalized using TF-IDF (see Section 4.1 for the deﬁnition). Cosine similarity is used to ﬁnd the most similar existing cluster, by comparing the document s vector to the centroid vector of each cluster. A user-deﬁned threshold is used to determine if the article is not similar enough to any existing clusters (0.4 was used in our experiments). The classiﬁcation algorithm that we used to train a model was a linear Support Vector Machine (SVM) method (Shawe-Taylor & Cristianini, 2004). We note that taking k = 500 or k = 1, 000 multilingual topics usually results in similar performance.