Optimal Estimation of Sparse Topic Models
Authors: Xin Bing, Florentina Bunea, Marten Wegkamp
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse A and sparse A, and has superior performance is many scenarios of interest. |
| Researcher Affiliation | Academia | Xin Bing EMAIL Florentina Bunea EMAIL Department of Statistics and Data Science Cornell University Ithaca, NY 14850, USA Marten Wegkamp EMAIL Department of Mathematics and Department of Statistics and Data Science Cornell University Ithaca, NY 14850, USA |
| Pseudocode | Yes | Algorithm 1 Sparse Topic Model solver (STM) |
| Open Source Code | No | The paper does not explicitly state that the source code for their proposed STM method is made available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We evaluate two real-world data sets, a corpus of NIPs articles and a corpus of New York Times (NYT) articles (Dheeru and Karra Taniskidou, 2017). |
| Dataset Splits | No | The paper uses synthetic and semi-synthetic data generation to evaluate performance but does not specify explicit train/test/validation splits for model evaluation in the traditional sense. It describes parameters for data generation (e.g., N, p, n, K, sparsity proportion η, Dirichlet parameters) and then evaluates estimation errors on these generated datasets, rather than using predefined splits of a static dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. It only reports running times. |
| Software Dependencies | No | The paper mentions using "the code of LDA from Riddell et al. (2016)" but does not provide specific version numbers for LDA or any other libraries used in their own implementation. |
| Experiment Setup | Yes | In Section 6.1 (Synthetic Data), the paper specifies parameters like "N = 1500, p = n = 1000, K = 20, |Ik| = p/200 and ξ = K/p." It also states, "For each η {0, 0.1, 0.2, . . . , 0.9}". In Section 6.2 (Semi-synthetic Data), it mentions "We set N = 850 and vary n {2000, 4000, 6000, 8000, 10000}", and "The columns of W are generated from the symmetric Dirichlet distribution with parameter 0.03." For their proposed Algorithm 1 (STM), they mention choosing "λ according to (29) and we select the anchor words either via AWR with specified K or via TOP (Bing et al., 2020)", and "our empirical study suggests the choice c0 = 0.01" for λ selection. |