Bayesian Text Classification and Summarization via A Class-Specified Topic Model
Authors: Feifei Wang, Junni L. Zhang, Yichao Li, Ke Deng, Jun S. Liu
JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze in detail the 20 Newsgroups dataset, a benchmark dataset for text classification, and demonstrate that CSTM has better performance than a two-stage approach based on latent Dirichlet allocation (LDA), several existing supervised extensions of LDA, and an L1 penalized logistic regression. The favorable performance of CSTM is also demonstrated through Monte Carlo simulations and an analysis of the Reuters dataset. |
| Researcher Affiliation | Academia | Feifei Wang EMAIL Center for Applied Statistics School of Statistics Renmin University of China Haidian District, Beijing 100872, China Junni L. Zhang EMAIL National School of Development, Center for Statistical Science and Center for Data Science Peking University Haidian District, Beijing 100871, China Yichao Li EMAIL Center for Statistical Science Tsinghua University Haidian District, Beijing 100084, China Ke Deng EMAIL Center for Statistical Science Tsinghua University Haidian District, Beijing 100084, China Jun S. Liu EMAIL Department of Statistics Harvard University Cambridge, MA 02138, USA |
| Pseudocode | No | The paper describes the generative processes for LDA and CSTM using figures and prose, and details the Gibbs Sampling algorithm in Appendix C through textual description and mathematical formulas rather than structured pseudocode blocks. |
| Open Source Code | No | The paper discusses the use of third-party open-source tools like "Gibbs LDA++ algorithm implemented in C/C++" and "open-source codes on Git Hub" for other models, but does not provide an explicit statement or link for the authors' own CSTM implementation. |
| Open Datasets | Yes | The 20 Newsgroups dataset1 is a benchmark dataset for text classification... This dataset is downloadable from http://qwone.com/~jason/20Newsgroups/. Reuters-215783 is another benchmark dataset for text classification... http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html |
| Dataset Splits | Yes | The 20 Newsgroups dataset... has also been partitioned into a training set with 60% of the documents and a test set with 40% of the documents. Reuters-215783... The Mod Apte split uses 9,603 articles before April 7, 1987 as the training set, and uses 3,299 articles after this date as the test set. |
| Hardware Specification | Yes | On a Dell XPS laptop with 2.8GHz CPU and 8Gb RAM, the training time is 625 minutes, and the prediction time is 35 minutes. |
| Software Dependencies | No | The paper mentions using "the NLTK library in Python", "the Gibbs LDA++ algorithm implemented in C/C++ (Phan and Nguyen, 2007)", "the improved GLMNET algorithm (Yuan et al., 2012)" and "the scikit-learn package in Python". While the software names and sometimes publication years are mentioned, specific version numbers for these libraries (e.g., NLTK 3.x, scikit-learn 0.xx) are not provided. |
| Experiment Setup | Yes | For the hyperparameters, we set α = 0.5, and βv = 0.1 for v = 1, , V , which are commonly used in LDA applications... We also set γ = 1... The tuning parameter for the Metropolis-Hastings step... is set at ξ = 0.1... Each chain is first run for B = 200 burn-in iterations... A set of G = 15 samples of Φ is then obtained by taking every 20th draw from the next 300 draws... The chain is then run for another 5000 iterations, with the first 1000 iterations discarded as burn-in iterations and the last 4000 iterations used for model inference. We set R1 = 5 and R2 = 5. |