reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Further results on latent discourse models and word embeddings

Authors: Sammy Khalife, Douglas Gonçalves, Youssef Allouah, Leo Liberti

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Second, we empirically investigate the relation between the geometry of word vectors and PMI. Although the experiments reported in (Arora et al., 2016, Section 5) support the concentration phenomenon and a linear correlation between squared norms of word vectors and word frequencies (at least, for high frequency words), little is said about the relationship between PMI of word pairs and the scalar product of their word vectors. In this work, we perform a thorough empirical investigation of this relationship. Our extensive experiments strongly support the claim that theoretical relations derived from the considered generative model occur at best in some regimes of the co-occurrence terms.
Researcher Affiliation	Academia	Sammy Khalife EMAIL LIX, CNRS, Ecole Polytechnique Institut Polytechnique de Paris 91128, Palaiseau, France Douglas Gonçalves EMAIL MTM/CFM Universidade Federal de Santa Catarina 88040-900, Florianópolis, Brazil Youssef Allouah EMAIL Ecole Polytechnique Institut Polytechnique de Paris 91128, Palaiseau, France Leo Liberti EMAIL LIX, CNRS, Ecole Polytechnique Institut Polytechnique de Paris 91128, Palaiseau, France
Pseudocode	No	The paper describes methods and proofs using mathematical notation and narrative text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The SN word embeddings were reproduced using code available at (Arora et al., 2018a) which tries to solve (28) using Ada Grad (Duchi et al., 2011) with initial learning rate 0.05 and 25 training epochs.
Open Datasets	Yes	The English Wikipedia was used to train the SN word embeddings. The corpus was preprocessed using the standard approach (non-textual elements removed, sentences split, tokenized). Only words appearing more than 1000 times are considered. Three different extracts from the English Wikipedia dump were used. The ﬁrst corpus (denoted corpus 1) consists of the ﬁrst 1 million documents of the 2016 Wikipedia dump, deprived of prepositions and pronouns. The second corpus and third corpus (denoted corpus 2 and corpus 3 respectively) consist of the ﬁrst 1,072,907 and 3,170,407 documents, respectively, of the 2020 Wikipedia dump.
Dataset Splits	No	The paper mentions using specific corpora (Wikipedia dumps) and preprocessing steps, but does not provide details about dataset splits (e.g., training, validation, test sets with percentages or sample counts) for the empirical verification experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU models, memory specifications) used for running its experiments.
Software Dependencies	No	The paper mentions 'Ada Grad (Duchi et al., 2011)' as an optimization algorithm and 'GloVe' and 'word2vec' as word embedding methods. However, no specific version numbers are provided for these or any other software dependencies.
Experiment Setup	Yes	The SN word embeddings were reproduced using code available at (Arora et al., 2018a) which tries to solve (28) using Ada Grad (Duchi et al., 2011) with initial learning rate 0.05 and 25 training epochs. The corpus was preprocessed using the standard approach (non-textual elements removed, sentences split, tokenized). Only words appearing more than 1000 times are considered.