reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian Text Classification and Summarization via A Class-Specified Topic Model

Authors: Feifei Wang, Junni L. Zhang, Yichao Li, Ke Deng, Jun S. Liu

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We analyze in detail the 20 Newsgroups dataset, a benchmark dataset for text classiﬁcation, and demonstrate that CSTM has better performance than a two-stage approach based on latent Dirichlet allocation (LDA), several existing supervised extensions of LDA, and an L1 penalized logistic regression. The favorable performance of CSTM is also demonstrated through Monte Carlo simulations and an analysis of the Reuters dataset.
Researcher Affiliation	Academia	Feifei Wang EMAIL Center for Applied Statistics School of Statistics Renmin University of China Haidian District, Beijing 100872, China Junni L. Zhang EMAIL National School of Development, Center for Statistical Science and Center for Data Science Peking University Haidian District, Beijing 100871, China Yichao Li EMAIL Center for Statistical Science Tsinghua University Haidian District, Beijing 100084, China Ke Deng EMAIL Center for Statistical Science Tsinghua University Haidian District, Beijing 100084, China Jun S. Liu EMAIL Department of Statistics Harvard University Cambridge, MA 02138, USA
Pseudocode	No	The paper describes the generative processes for LDA and CSTM using figures and prose, and details the Gibbs Sampling algorithm in Appendix C through textual description and mathematical formulas rather than structured pseudocode blocks.
Open Source Code	No	The paper discusses the use of third-party open-source tools like "Gibbs LDA++ algorithm implemented in C/C++" and "open-source codes on Git Hub" for other models, but does not provide an explicit statement or link for the authors' own CSTM implementation.
Open Datasets	Yes	The 20 Newsgroups dataset1 is a benchmark dataset for text classiﬁcation... This dataset is downloadable from http://qwone.com/~jason/20Newsgroups/. Reuters-215783 is another benchmark dataset for text classiﬁcation... http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Dataset Splits	Yes	The 20 Newsgroups dataset... has also been partitioned into a training set with 60% of the documents and a test set with 40% of the documents. Reuters-215783... The Mod Apte split uses 9,603 articles before April 7, 1987 as the training set, and uses 3,299 articles after this date as the test set.
Hardware Specification	Yes	On a Dell XPS laptop with 2.8GHz CPU and 8Gb RAM, the training time is 625 minutes, and the prediction time is 35 minutes.
Software Dependencies	No	The paper mentions using "the NLTK library in Python", "the Gibbs LDA++ algorithm implemented in C/C++ (Phan and Nguyen, 2007)", "the improved GLMNET algorithm (Yuan et al., 2012)" and "the scikit-learn package in Python". While the software names and sometimes publication years are mentioned, specific version numbers for these libraries (e.g., NLTK 3.x, scikit-learn 0.xx) are not provided.
Experiment Setup	Yes	For the hyperparameters, we set α = 0.5, and βv = 0.1 for v = 1, , V , which are commonly used in LDA applications... We also set γ = 1... The tuning parameter for the Metropolis-Hastings step... is set at ξ = 0.1... Each chain is ﬁrst run for B = 200 burn-in iterations... A set of G = 15 samples of Φ is then obtained by taking every 20th draw from the next 300 draws... The chain is then run for another 5000 iterations, with the ﬁrst 1000 iterations discarded as burn-in iterations and the last 4000 iterations used for model inference. We set R1 = 5 and R2 = 5.