Bayesian Text Classification and Summarization via A Class-Specified Topic Model

Authors: Feifei Wang, Junni L. Zhang, Yichao Li, Ke Deng, Jun S. Liu

JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We analyze in detail the 20 Newsgroups dataset, a benchmark dataset for text classification, and demonstrate that CSTM has better performance than a two-stage approach based on latent Dirichlet allocation (LDA), several existing supervised extensions of LDA, and an L1 penalized logistic regression. The favorable performance of CSTM is also demonstrated through Monte Carlo simulations and an analysis of the Reuters dataset.
Researcher Affiliation Academia Feifei Wang EMAIL Center for Applied Statistics School of Statistics Renmin University of China Haidian District, Beijing 100872, China Junni L. Zhang EMAIL National School of Development, Center for Statistical Science and Center for Data Science Peking University Haidian District, Beijing 100871, China Yichao Li EMAIL Center for Statistical Science Tsinghua University Haidian District, Beijing 100084, China Ke Deng EMAIL Center for Statistical Science Tsinghua University Haidian District, Beijing 100084, China Jun S. Liu EMAIL Department of Statistics Harvard University Cambridge, MA 02138, USA
Pseudocode No The paper describes the generative processes for LDA and CSTM using figures and prose, and details the Gibbs Sampling algorithm in Appendix C through textual description and mathematical formulas rather than structured pseudocode blocks.
Open Source Code No The paper discusses the use of third-party open-source tools like "Gibbs LDA++ algorithm implemented in C/C++" and "open-source codes on Git Hub" for other models, but does not provide an explicit statement or link for the authors' own CSTM implementation.
Open Datasets Yes The 20 Newsgroups dataset1 is a benchmark dataset for text classification... This dataset is downloadable from http://qwone.com/~jason/20Newsgroups/. Reuters-215783 is another benchmark dataset for text classification... http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Dataset Splits Yes The 20 Newsgroups dataset... has also been partitioned into a training set with 60% of the documents and a test set with 40% of the documents. Reuters-215783... The Mod Apte split uses 9,603 articles before April 7, 1987 as the training set, and uses 3,299 articles after this date as the test set.
Hardware Specification Yes On a Dell XPS laptop with 2.8GHz CPU and 8Gb RAM, the training time is 625 minutes, and the prediction time is 35 minutes.
Software Dependencies No The paper mentions using "the NLTK library in Python", "the Gibbs LDA++ algorithm implemented in C/C++ (Phan and Nguyen, 2007)", "the improved GLMNET algorithm (Yuan et al., 2012)" and "the scikit-learn package in Python". While the software names and sometimes publication years are mentioned, specific version numbers for these libraries (e.g., NLTK 3.x, scikit-learn 0.xx) are not provided.
Experiment Setup Yes For the hyperparameters, we set α = 0.5, and βv = 0.1 for v = 1, , V , which are commonly used in LDA applications... We also set γ = 1... The tuning parameter for the Metropolis-Hastings step... is set at ξ = 0.1... Each chain is first run for B = 200 burn-in iterations... A set of G = 15 samples of Φ is then obtained by taking every 20th draw from the next 300 draws... The chain is then run for another 5000 iterations, with the first 1000 iterations discarded as burn-in iterations and the last 4000 iterations used for model inference. We set R1 = 5 and R2 = 5.