Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Further results on latent discourse models and word embeddings
Authors: Sammy Khalife, Douglas Gonçalves, Youssef Allouah, Leo Liberti
JMLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Second, we empirically investigate the relation between the geometry of word vectors and PMI. Although the experiments reported in (Arora et al., 2016, Section 5) support the concentration phenomenon and a linear correlation between squared norms of word vectors and word frequencies (at least, for high frequency words), little is said about the relationship between PMI of word pairs and the scalar product of their word vectors. In this work, we perform a thorough empirical investigation of this relationship. Our extensive experiments strongly support the claim that theoretical relations derived from the considered generative model occur at best in some regimes of the co-occurrence terms. |
| Researcher Affiliation | Academia | Sammy Khalife EMAIL LIX, CNRS, Ecole Polytechnique Institut Polytechnique de Paris 91128, Palaiseau, France Douglas Gonçalves EMAIL MTM/CFM Universidade Federal de Santa Catarina 88040-900, Florianópolis, Brazil Youssef Allouah EMAIL Ecole Polytechnique Institut Polytechnique de Paris 91128, Palaiseau, France Leo Liberti EMAIL LIX, CNRS, Ecole Polytechnique Institut Polytechnique de Paris 91128, Palaiseau, France |
| Pseudocode | No | The paper describes methods and proofs using mathematical notation and narrative text, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The SN word embeddings were reproduced using code available at (Arora et al., 2018a) which tries to solve (28) using Ada Grad (Duchi et al., 2011) with initial learning rate 0.05 and 25 training epochs. |
| Open Datasets | Yes | The English Wikipedia was used to train the SN word embeddings. The corpus was preprocessed using the standard approach (non-textual elements removed, sentences split, tokenized). Only words appearing more than 1000 times are considered. Three different extracts from the English Wikipedia dump were used. The first corpus (denoted corpus 1) consists of the first 1 million documents of the 2016 Wikipedia dump, deprived of prepositions and pronouns. The second corpus and third corpus (denoted corpus 2 and corpus 3 respectively) consist of the first 1,072,907 and 3,170,407 documents, respectively, of the 2020 Wikipedia dump. |
| Dataset Splits | No | The paper mentions using specific corpora (Wikipedia dumps) and preprocessing steps, but does not provide details about dataset splits (e.g., training, validation, test sets with percentages or sample counts) for the empirical verification experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU models, memory specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Ada Grad (Duchi et al., 2011)' as an optimization algorithm and 'GloVe' and 'word2vec' as word embedding methods. However, no specific version numbers are provided for these or any other software dependencies. |
| Experiment Setup | Yes | The SN word embeddings were reproduced using code available at (Arora et al., 2018a) which tries to solve (28) using Ada Grad (Duchi et al., 2011) with initial learning rate 0.05 and 25 training epochs. The corpus was preprocessed using the standard approach (non-textual elements removed, sentences split, tokenized). Only words appearing more than 1000 times are considered. |