Gibbs Max-margin Topic Models with Data Augmentation

Authors: Jun Zhu, Ning Chen, Hugh Perkins, Bo Zhang

JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on several medium-sized and large-scale data sets demonstrate significant improvements on time efficiency. The classification performance is also improved over competitors on binary, multi-class and multi-label classification tasks.
Researcher Affiliation Academia Jun Zhu EMAIL Ning Chen EMAIL Hugh Perkins EMAIL Bo Zhang EMAIL State Key Lab of Intelligent Technology and Systems Tsinghua National Lab for Information Science and Technology Department of Computer Science and Technology Tsinghua University Beijing, 100084, China
Pseudocode Yes Algorithm 1 collapsed Gibbs sampling algorithm for Gibbs Med LDA classification models
Open Source Code Yes Finally, we release the code for public use.10 The code is available at: http://www.ml-thu.net/~jun/gibbs-medlda.shtml.
Open Datasets Yes We present empirical results to demonstrate the efficiency and prediction performance of Gibbs Med LDA (denoted by Gibbs Med LDA) on the 20Newsgroups data set for classification, a hotel review data set for regression, and a Wikipedia data set with more than 1 million documents for multi-label classification. ...The Wiki data set which is built from the large Wikipedia set used in the PASCAL LSHC challenge 2012, and where each document has multiple labels. The original data set is extremely imbalanced.8 The data set is available at: http://lshtc.iit.demokritos.gr/.
Dataset Splits Yes The training set contains 856 documents, and the test set contains 569 documents. ...The data set is uniformly partitioned into training and testing sets. ...The test set consists of 7,505 documents, in which the smallest category has 251 documents and the largest category has 399 documents. The training set consists of 11,269 documents, in which the smallest and the largest categories contain 376 and 599 documents, respectively. ...The training set consists of 1.1 millions of documents and the testing set consists of 5,000 documents.
Hardware Specification No All the experiments, except the those on the large Wikipedia data set, are done on a standard desktop computer. ...Figure 5 shows the precision, recall and F1 measure (i.e., the harmonic mean of precision and recall) of various models running on a distributed cluster with 20 nodes (each node is equipped with two 6-core CPUs). Explanation: The hardware for the main experiments is described as a "standard desktop computer," which is not specific enough. For the Wikipedia dataset, a "distributed cluster with 20 nodes (each node is equipped with two 6-core CPUs)" is mentioned, providing core counts but not specific CPU *models* or other detailed specifications like memory.
Software Dependencies No For Gibbs LDA, we learn a binary linear SVM on its topic representations using SVMLight (Joachims, 1999). For Gibbs LDA, we use the parallel implementation in Yahoo-LDA, which is publicly available at: https: //github.com/shravanmn/Yahoo_LDA. Explanation: The paper mentions specific tools like SVMLight and Yahoo-LDA, but does not provide version numbers for these or any other ancillary software dependencies that are crucial for replication.
Experiment Setup Yes For Gibbs Med LDA, we set α = 1, ℓ= 164 and M = 10. ...we fix c = 1 for simplicity. ...For Gibbs Med LDA and v Med LDA, the precision is set at ϵ = 1e 3 and c is selected via 5 fold cross-validation during training. Again, we set the Dirichlet parameter α = 1 and the number of burn-in M = 10. ...Again, since Gibbs Med LDA is insensitive to α and ℓ, we set α = 1 and ℓ= 64. We also fix c = 1 for simplicity. The number of burn-in iterations is set as M = 20. ...For multi-task Gibbs Med LDA, we use 40 burn-in steps, which is sufficiently large.