Augmentable Gamma Belief Networks
Authors: Mingyuan Zhou, Yulai Cong, Bo Chen
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With extensive experiments in text and image analysis, we demonstrate that the deep GBN with two or more hidden layers clearly outperforms the shallow GBN with a single hidden layer in both unsupervisedly extracting latent features for classification and predicting heldout data. |
| Researcher Affiliation | Academia | Mingyuan Zhou EMAIL Department of Information, Risk, and Operations Management Mc Combs School of Business The University of Texas at Austin Austin, TX 78712, USA Yulai Cong yulai EMAIL Bo Chen EMAIL National Laboratory of Radar Signal Processing Collaborative Innovation Center of Information Sensing and Understanding Xidian University Xi an, Shaanxi 710071, China |
| Pseudocode | Yes | Algorithm 1 The PGBN upward-downward Gibbs sampler that uses a layer-wise training strategy to train a set of networks, each of which adds an additional hidden layer on top of the previously inferred network, retrains all its layers jointly, and prunes inactive factors from the last layer. Algorithm 2 The upward-downward Gibbs samplers for the Ber-GBN and PRG-GBN are constructed by using Lines 1-8 shown below to substitute Lines 4-11 of the PGBN Gibbs sampler shown in Algorithm 1. |
| Open Source Code | Yes | Matlab code will be available in http://mingyuanzhou.github.io/. |
| Open Datasets | Yes | We consider the 20newsgroups data set that consists of 18,774 documents from 20 different news groups, with a vocabulary of size K0 = 61,188. It is partitioned into a training set of 11,269 documents and a testing set of 7,505 ones. (http://qwone.com/ jason/20Newsgroups/) We consider both all the 18,774 documents of the 20newsgroups corpus, limiting the vocabulary to the 2000 most frequent terms after removing a standard list of stopwords, and the NIPS12 (http://www.cs.nyu.edu/ roweis/data.html) corpus whose stopwords have already been removed, limiting the vocabulary to the 2000 most frequent terms. We consider the MNIST data set (http://yann.lecun.com/exdb/mnist/), which consists of 60000 training handwritten digits and 10000 testing ones. |
| Dataset Splits | Yes | It is partitioned into a training set of 11,269 documents and a testing set of 7,505 ones. We randomly choose 30% of the word tokens in each document as training, and use the remaining ones to calculate per-heldout-word perplexity. |
| Hardware Specification | Yes | Each iteration of jointly training multiple layers usually only costs moderately more than that of training a single layer, e.g., with K1 max = 400, a training iteration on a single core of an Intel Xeon 2.7 GHz CPU takes about 5.6, 6.7, 7.1 seconds for the PGBN with 1, 3, and 5 layers, respectively. |
| Software Dependencies | No | We use the L2 regularized logistic regression provided by the LIBLINEAR package (Fan et al., 2008) to train a linear classifier on θj in the training set and use it to classify θj in the test set, where the regularization parameter is five-folder cross-validated on the training set from (2^-10, 2^-9, . . . , 2^15). |
| Experiment Setup | Yes | We set the hyper-parameters as a0 = b0 = 0.01 and e0 = f0 = 1. Given the trained network, we apply the upward-downward Gibbs sampler to collect 500 MCMC samples after 500 burnins to estimate the posterior mean of the feature usage proportion vector θ(1) j /θ(1) j at the first hidden layer, for every document in both the training and testing sets. With the upper bound of the first layer s width set as K1 max {25, 50, 100, 200, 400, 600, 800}, and Bt = Ct = 1000 and η(t) = 0.01 for all t, we use Algorithm 1 to train a network with T {1, 2, . . . , 8} layers. We set Ct = 500 and η(t) = 0.05 for all t; we set Bt = 1000 for all t if K1 max ≤ 400, and set B1 = 1000 and Bt = 500 for t ≥ 2 if K1 max > 400. |