Lightweight Random Indexing for Polylingual Text Classification

Authors: Alejandro Moreo Fernández, Andrea Esuli, Fabrizio Sebastiani

JAIR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-translation-free and dictionary-free PLTC methods that we use as baselines.
Researcher Affiliation Academia Alejandro Moreo Fernández EMAIL Andrea Esuli EMAIL Istituto di Scienza e Tecnologie dell’Informazione Consiglio Nazionale delle Ricerche 56124 Pisa, IT Fabrizio Sebastiani EMAIL Qatar Computing Research Institute Hamad bin Khalifa University PO Box 5825, Doha, QA
Pseudocode Yes Algorithm 1: Feature Dictionary for Lightweight Random Indexing.
Open Source Code Yes We have implemented the LRI method and the other baseline methods as part of the Esuli, Fagni, and Moreo (2016) framework. We have used Support Vector Machines (SVMs) as the learning device in all cases, since it has consistently delivered state-of-the-art results in TC so far; for it we used the well-known Joachims (2009) implementation of Joachims (2005), with default parameters. ... The source code we used in our experiments is accessible as part of the Esuli et al. (2016) framework
Open Datasets Yes RCV1 is a publicly available collection consisting of the 804,414 English news stories generated by Reuters from 20 Aug 1996 to 19 Aug 1997 (Lewis, Yang, Rose, & Li, 2004). ... JRC-Acquis (version 3.0) is a version of the Acquis Communautaire collection of parallel legislative texts from European Union law written between the 1950s and 2006 (Steinberger, Pouliquen, Widiger, Ignat, Erjavec, Tufis, & Varga, 2006). JRC-Acquis is publicly available for research purposes
Dataset Splits Yes From RCV1/RCV2 we randomly selected 8,000 news stories for 5 languages (English, Italian, Spanish, French, German) pertaining to the last 4 months (from 1997-04-19 to 1997-08-19), and we performed a 70%/30% train/test split, thus obtaining a training set of 28,000 documents (5,600 for each language) and a test set of 12,000 documents (2,400 for each language)... We have selected the 7,235 texts from 2006 for 5 languages (English, Italian, Spanish, French, and German) and removed documents without labels, thus obtaining 6,980 documents per language. We have taken the first 70% documents for training (24,430, i.e., 4,886 for each language) and the remaining 30% (10,470, i.e., 2,094 for each language) for testing.
Hardware Specification Yes All the experiments were run on an Intel i7 64bit processor with 12 cores, running at 1,600MHz, and 24GBs RAM memory.
Software Dependencies No The paper mentions several third-party tools and implementations like 'Rohde (2011) package' for SVD, 'Haddow, Hoang, Bertoldi, Bojar, and Heafield (2016) implementation' for statistical translation systems, 'Richardson (2008) implementation' for PLDA, and 'Joachims (2009) implementation of Joachims (2005)' for SVMs. However, it does not provide specific version numbers for these software packages themselves (e.g., 'PyTorch 1.9' or 'CPLEX 12.4'), only references to papers or years of implementation.
Experiment Setup Yes Once n is fixed, a recommended choice of k in the literature is k = n/100. We dub this configuration RI1%... we propose the use of Random Indexing with a fixed k = 2; we dub this configuration Lightweight Random Indexing (LRI)... For PLDA we have used the Richardson (2008) implementation, which uses Gibbs sampling; we adhere to the common practice of fixing the budget of iterations to 1,000... We have used Support Vector Machines (SVMs) as the learning device in all cases... for it we used the well-known Joachims (2009) implementation of Joachims (2005), with default parameters.