A Statistical Approach for Optimal Topic Model Identification
Authors: Craig M. Lewis, Francesco Grossetti
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study confirming the superiority of our approach compared to other standard heuristic metrics like the perplexity index. |
| Researcher Affiliation | Academia | Craig M. Lewis EMAIL Owen Graduate School of Management Vanderbilt University, Nashville, TN, USA Francesco Grossetti EMAIL Department of Accounting and Bocconi Institute for Data Science and Business Analytics (BIDSA) Bocconi University, Milan, Italy |
| Pseudocode | No | The paper describes mathematical tests (Test 1, Test 2, Test 3, Test 4, Test 5) with equations and descriptive text, but it does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The authors are currently developing the corresponding R package Op Top that will calculate all the tests introduced in this work. The package directly interacts with topicmodels and the related LDA VEM class (Hornik and Gr un, 2011) which provides the estimates for the LDA models.11 The package can be found on Github at https://github.com/contefranz/Op Top. The development version is available for installation and testing. |
| Open Datasets | Yes | We test our algorithm on the U.S. presidential inaugural address texts (Peters, 2018). The corpus contains 58 documents of US president s inaugural addresses starting with George Washington s first inaugural address in 1789. ... Gerhard Peters. The American Presidency Project, 2018. URL https://www.presidency.ucsb.edu. |
| Dataset Splits | No | The paper uses the U.S. Presidential Inaugural Address Corpus as a case study and mentions generating synthetic corpora for a simulation study, but does not provide specific training/test/validation dataset splits for the real-world corpus. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | Yes | Text processing and management have been carried out with the R package quanteda (Benoit et al., 2018). LDA models are estimated with the R package topicmodels (Hornik and Gr un, 2011) which exploits the original C code for the VEM fitting implemented by Blei et al. (2003).10 We use the open source R (R Core Team, 2021) programming language for data processing and visualizations. In particular, the former have been carried out with the data.table package (Dowle and Srinivasan, 2017) while the latter with ggplot2 (Wickham, 2009). 11. The package can be found on Github at https://github.com/contefranz/Op Top. The development version is available for installation and testing. ... The simulation study relies on the R package LDATS (Simonis et al., 2020). |
| Experiment Setup | No | The paper mentions estimating LDA models (e.g., from 2 to 200 topics) and using VEM method for inference. However, it does not provide specific hyperparameters or system-level training settings like learning rates, batch sizes, or optimizer configurations in the main text. |