reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Optimal Estimation of Sparse Topic Models

Authors: Xin Bing, Florentina Bunea, Marten Wegkamp

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse A and sparse A, and has superior performance is many scenarios of interest.
Researcher Affiliation	Academia	Xin Bing EMAIL Florentina Bunea EMAIL Department of Statistics and Data Science Cornell University Ithaca, NY 14850, USA Marten Wegkamp EMAIL Department of Mathematics and Department of Statistics and Data Science Cornell University Ithaca, NY 14850, USA
Pseudocode	Yes	Algorithm 1 Sparse Topic Model solver (STM)
Open Source Code	No	The paper does not explicitly state that the source code for their proposed STM method is made available, nor does it provide a link to a code repository.
Open Datasets	Yes	We evaluate two real-world data sets, a corpus of NIPs articles and a corpus of New York Times (NYT) articles (Dheeru and Karra Taniskidou, 2017).
Dataset Splits	No	The paper uses synthetic and semi-synthetic data generation to evaluate performance but does not specify explicit train/test/validation splits for model evaluation in the traditional sense. It describes parameters for data generation (e.g., N, p, n, K, sparsity proportion η, Dirichlet parameters) and then evaluates estimation errors on these generated datasets, rather than using predefined splits of a static dataset.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. It only reports running times.
Software Dependencies	No	The paper mentions using "the code of LDA from Riddell et al. (2016)" but does not provide specific version numbers for LDA or any other libraries used in their own implementation.
Experiment Setup	Yes	In Section 6.1 (Synthetic Data), the paper specifies parameters like "N = 1500, p = n = 1000, K = 20, \|Ik\| = p/200 and ξ = K/p." It also states, "For each η {0, 0.1, 0.2, . . . , 0.9}". In Section 6.2 (Semi-synthetic Data), it mentions "We set N = 850 and vary n {2000, 4000, 6000, 8000, 10000}", and "The columns of W are generated from the symmetric Dirichlet distribution with parameter 0.03." For their proposed Algorithm 1 (STM), they mention choosing "λ according to (29) and we select the anchor words either via AWR with speciﬁed K or via TOP (Bing et al., 2020)", and "our empirical study suggests the choice c0 = 0.01" for λ selection.