reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Statistical Approach for Optimal Topic Model Identification

Authors: Craig M. Lewis, Francesco Grossetti

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study conﬁrming the superiority of our approach compared to other standard heuristic metrics like the perplexity index.
Researcher Affiliation	Academia	Craig M. Lewis EMAIL Owen Graduate School of Management Vanderbilt University, Nashville, TN, USA Francesco Grossetti EMAIL Department of Accounting and Bocconi Institute for Data Science and Business Analytics (BIDSA) Bocconi University, Milan, Italy
Pseudocode	No	The paper describes mathematical tests (Test 1, Test 2, Test 3, Test 4, Test 5) with equations and descriptive text, but it does not provide any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The authors are currently developing the corresponding R package Op Top that will calculate all the tests introduced in this work. The package directly interacts with topicmodels and the related LDA VEM class (Hornik and Gr un, 2011) which provides the estimates for the LDA models.11 The package can be found on Github at https://github.com/contefranz/Op Top. The development version is available for installation and testing.
Open Datasets	Yes	We test our algorithm on the U.S. presidential inaugural address texts (Peters, 2018). The corpus contains 58 documents of US president s inaugural addresses starting with George Washington s ﬁrst inaugural address in 1789. ... Gerhard Peters. The American Presidency Project, 2018. URL https://www.presidency.ucsb.edu.
Dataset Splits	No	The paper uses the U.S. Presidential Inaugural Address Corpus as a case study and mentions generating synthetic corpora for a simulation study, but does not provide specific training/test/validation dataset splits for the real-world corpus.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies	Yes	Text processing and management have been carried out with the R package quanteda (Benoit et al., 2018). LDA models are estimated with the R package topicmodels (Hornik and Gr un, 2011) which exploits the original C code for the VEM ﬁtting implemented by Blei et al. (2003).10 We use the open source R (R Core Team, 2021) programming language for data processing and visualizations. In particular, the former have been carried out with the data.table package (Dowle and Srinivasan, 2017) while the latter with ggplot2 (Wickham, 2009). 11. The package can be found on Github at https://github.com/contefranz/Op Top. The development version is available for installation and testing. ... The simulation study relies on the R package LDATS (Simonis et al., 2020).
Experiment Setup	No	The paper mentions estimating LDA models (e.g., from 2 to 200 topics) and using VEM method for inference. However, it does not provide specific hyperparameters or system-level training settings like learning rates, batch sizes, or optimizer configurations in the main text.