reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Lightweight Random Indexing for Polylingual Text Classification

Authors: Alejandro Moreo Fernández, Andrea Esuli, Fabrizio Sebastiani

JAIR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of eﬀectiveness and eﬃciency) a number of previously proposed machine-translation-free and dictionary-free PLTC methods that we use as baselines.
Researcher Affiliation	Academia	Alejandro Moreo Fernández EMAIL Andrea Esuli EMAIL Istituto di Scienza e Tecnologie dell’Informazione Consiglio Nazionale delle Ricerche 56124 Pisa, IT Fabrizio Sebastiani EMAIL Qatar Computing Research Institute Hamad bin Khalifa University PO Box 5825, Doha, QA
Pseudocode	Yes	Algorithm 1: Feature Dictionary for Lightweight Random Indexing.
Open Source Code	Yes	We have implemented the LRI method and the other baseline methods as part of the Esuli, Fagni, and Moreo (2016) framework. We have used Support Vector Machines (SVMs) as the learning device in all cases, since it has consistently delivered state-of-the-art results in TC so far; for it we used the well-known Joachims (2009) implementation of Joachims (2005), with default parameters. ... The source code we used in our experiments is accessible as part of the Esuli et al. (2016) framework
Open Datasets	Yes	RCV1 is a publicly available collection consisting of the 804,414 English news stories generated by Reuters from 20 Aug 1996 to 19 Aug 1997 (Lewis, Yang, Rose, & Li, 2004). ... JRC-Acquis (version 3.0) is a version of the Acquis Communautaire collection of parallel legislative texts from European Union law written between the 1950s and 2006 (Steinberger, Pouliquen, Widiger, Ignat, Erjavec, Tuﬁs, & Varga, 2006). JRC-Acquis is publicly available for research purposes
Dataset Splits	Yes	From RCV1/RCV2 we randomly selected 8,000 news stories for 5 languages (English, Italian, Spanish, French, German) pertaining to the last 4 months (from 1997-04-19 to 1997-08-19), and we performed a 70%/30% train/test split, thus obtaining a training set of 28,000 documents (5,600 for each language) and a test set of 12,000 documents (2,400 for each language)... We have selected the 7,235 texts from 2006 for 5 languages (English, Italian, Spanish, French, and German) and removed documents without labels, thus obtaining 6,980 documents per language. We have taken the ﬁrst 70% documents for training (24,430, i.e., 4,886 for each language) and the remaining 30% (10,470, i.e., 2,094 for each language) for testing.
Hardware Specification	Yes	All the experiments were run on an Intel i7 64bit processor with 12 cores, running at 1,600MHz, and 24GBs RAM memory.
Software Dependencies	No	The paper mentions several third-party tools and implementations like 'Rohde (2011) package' for SVD, 'Haddow, Hoang, Bertoldi, Bojar, and Heaﬁeld (2016) implementation' for statistical translation systems, 'Richardson (2008) implementation' for PLDA, and 'Joachims (2009) implementation of Joachims (2005)' for SVMs. However, it does not provide specific version numbers for these software packages themselves (e.g., 'PyTorch 1.9' or 'CPLEX 12.4'), only references to papers or years of implementation.
Experiment Setup	Yes	Once n is ﬁxed, a recommended choice of k in the literature is k = n/100. We dub this conﬁguration RI1%... we propose the use of Random Indexing with a ﬁxed k = 2; we dub this conﬁguration Lightweight Random Indexing (LRI)... For PLDA we have used the Richardson (2008) implementation, which uses Gibbs sampling; we adhere to the common practice of ﬁxing the budget of iterations to 1,000... We have used Support Vector Machines (SVMs) as the learning device in all cases... for it we used the well-known Joachims (2009) implementation of Joachims (2005), with default parameters.