Augmentable Gamma Belief Networks

Authors: Mingyuan Zhou, Yulai Cong, Bo Chen

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With extensive experiments in text and image analysis, we demonstrate that the deep GBN with two or more hidden layers clearly outperforms the shallow GBN with a single hidden layer in both unsupervisedly extracting latent features for classification and predicting heldout data.
Researcher Affiliation Academia Mingyuan Zhou EMAIL Department of Information, Risk, and Operations Management Mc Combs School of Business The University of Texas at Austin Austin, TX 78712, USA Yulai Cong yulai EMAIL Bo Chen EMAIL National Laboratory of Radar Signal Processing Collaborative Innovation Center of Information Sensing and Understanding Xidian University Xi an, Shaanxi 710071, China
Pseudocode Yes Algorithm 1 The PGBN upward-downward Gibbs sampler that uses a layer-wise training strategy to train a set of networks, each of which adds an additional hidden layer on top of the previously inferred network, retrains all its layers jointly, and prunes inactive factors from the last layer. Algorithm 2 The upward-downward Gibbs samplers for the Ber-GBN and PRG-GBN are constructed by using Lines 1-8 shown below to substitute Lines 4-11 of the PGBN Gibbs sampler shown in Algorithm 1.
Open Source Code Yes Matlab code will be available in http://mingyuanzhou.github.io/.
Open Datasets Yes We consider the 20newsgroups data set that consists of 18,774 documents from 20 different news groups, with a vocabulary of size K0 = 61,188. It is partitioned into a training set of 11,269 documents and a testing set of 7,505 ones. (http://qwone.com/ jason/20Newsgroups/) We consider both all the 18,774 documents of the 20newsgroups corpus, limiting the vocabulary to the 2000 most frequent terms after removing a standard list of stopwords, and the NIPS12 (http://www.cs.nyu.edu/ roweis/data.html) corpus whose stopwords have already been removed, limiting the vocabulary to the 2000 most frequent terms. We consider the MNIST data set (http://yann.lecun.com/exdb/mnist/), which consists of 60000 training handwritten digits and 10000 testing ones.
Dataset Splits Yes It is partitioned into a training set of 11,269 documents and a testing set of 7,505 ones. We randomly choose 30% of the word tokens in each document as training, and use the remaining ones to calculate per-heldout-word perplexity.
Hardware Specification Yes Each iteration of jointly training multiple layers usually only costs moderately more than that of training a single layer, e.g., with K1 max = 400, a training iteration on a single core of an Intel Xeon 2.7 GHz CPU takes about 5.6, 6.7, 7.1 seconds for the PGBN with 1, 3, and 5 layers, respectively.
Software Dependencies No We use the L2 regularized logistic regression provided by the LIBLINEAR package (Fan et al., 2008) to train a linear classifier on θj in the training set and use it to classify θj in the test set, where the regularization parameter is five-folder cross-validated on the training set from (2^-10, 2^-9, . . . , 2^15).
Experiment Setup Yes We set the hyper-parameters as a0 = b0 = 0.01 and e0 = f0 = 1. Given the trained network, we apply the upward-downward Gibbs sampler to collect 500 MCMC samples after 500 burnins to estimate the posterior mean of the feature usage proportion vector θ(1) j /θ(1) j at the first hidden layer, for every document in both the training and testing sets. With the upper bound of the first layer s width set as K1 max {25, 50, 100, 200, 400, 600, 800}, and Bt = Ct = 1000 and η(t) = 0.01 for all t, we use Algorithm 1 to train a network with T {1, 2, . . . , 8} layers. We set Ct = 500 and η(t) = 0.05 for all t; we set Bt = 1000 for all t if K1 max ≤ 400, and set B1 = 1000 and Bt = 500 for t ≥ 2 if K1 max > 400.