Gibbs Max-margin Topic Models with Data Augmentation
Authors: Jun Zhu, Ning Chen, Hugh Perkins, Bo Zhang
JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on several medium-sized and large-scale data sets demonstrate significant improvements on time efficiency. The classification performance is also improved over competitors on binary, multi-class and multi-label classification tasks. |
| Researcher Affiliation | Academia | Jun Zhu EMAIL Ning Chen EMAIL Hugh Perkins EMAIL Bo Zhang EMAIL State Key Lab of Intelligent Technology and Systems Tsinghua National Lab for Information Science and Technology Department of Computer Science and Technology Tsinghua University Beijing, 100084, China |
| Pseudocode | Yes | Algorithm 1 collapsed Gibbs sampling algorithm for Gibbs Med LDA classification models |
| Open Source Code | Yes | Finally, we release the code for public use.10 The code is available at: http://www.ml-thu.net/~jun/gibbs-medlda.shtml. |
| Open Datasets | Yes | We present empirical results to demonstrate the efficiency and prediction performance of Gibbs Med LDA (denoted by Gibbs Med LDA) on the 20Newsgroups data set for classification, a hotel review data set for regression, and a Wikipedia data set with more than 1 million documents for multi-label classification. ...The Wiki data set which is built from the large Wikipedia set used in the PASCAL LSHC challenge 2012, and where each document has multiple labels. The original data set is extremely imbalanced.8 The data set is available at: http://lshtc.iit.demokritos.gr/. |
| Dataset Splits | Yes | The training set contains 856 documents, and the test set contains 569 documents. ...The data set is uniformly partitioned into training and testing sets. ...The test set consists of 7,505 documents, in which the smallest category has 251 documents and the largest category has 399 documents. The training set consists of 11,269 documents, in which the smallest and the largest categories contain 376 and 599 documents, respectively. ...The training set consists of 1.1 millions of documents and the testing set consists of 5,000 documents. |
| Hardware Specification | No | All the experiments, except the those on the large Wikipedia data set, are done on a standard desktop computer. ...Figure 5 shows the precision, recall and F1 measure (i.e., the harmonic mean of precision and recall) of various models running on a distributed cluster with 20 nodes (each node is equipped with two 6-core CPUs). Explanation: The hardware for the main experiments is described as a "standard desktop computer," which is not specific enough. For the Wikipedia dataset, a "distributed cluster with 20 nodes (each node is equipped with two 6-core CPUs)" is mentioned, providing core counts but not specific CPU *models* or other detailed specifications like memory. |
| Software Dependencies | No | For Gibbs LDA, we learn a binary linear SVM on its topic representations using SVMLight (Joachims, 1999). For Gibbs LDA, we use the parallel implementation in Yahoo-LDA, which is publicly available at: https: //github.com/shravanmn/Yahoo_LDA. Explanation: The paper mentions specific tools like SVMLight and Yahoo-LDA, but does not provide version numbers for these or any other ancillary software dependencies that are crucial for replication. |
| Experiment Setup | Yes | For Gibbs Med LDA, we set α = 1, ℓ= 164 and M = 10. ...we fix c = 1 for simplicity. ...For Gibbs Med LDA and v Med LDA, the precision is set at ϵ = 1e 3 and c is selected via 5 fold cross-validation during training. Again, we set the Dirichlet parameter α = 1 and the number of burn-in M = 10. ...Again, since Gibbs Med LDA is insensitive to α and ℓ, we set α = 1 and ℓ= 64. We also fix c = 1 for simplicity. The number of burn-in iterations is set as M = 20. ...For multi-task Gibbs Med LDA, we use 40 burn-in steps, which is sufficiently large. |