Neural Autoregressive Distribution Estimation
Authors: Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, Hugo Larochelle
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We discuss how they achieve competitive performance in modeling both binary and real-valued observations. We also present how deep NADE models can be trained to be agnostic to the ordering of input dimensions used by the autoregressive product rule decomposition. Finally, we also show how to exploit the topological structure of pixels in images using a deep convolutional architecture for NADE. |
| Researcher Affiliation | Collaboration | Benigno Uria EMAIL Google Deep Mind London, UK; Marc-Alexandre Cˆot e EMAIL Department of Computer Science Universit e de Sherbrooke Sherbrooke, J1K 2R1, QC, Canada; Karol Gregor EMAIL Google Deep Mind London, UK; Iain Murray EMAIL School of Informatics University of Edinburgh Edinburgh EH8 9AB, UK; Hugo Larochelle EMAIL Twitter 141 Portland St, Floor 6 Cambridge MA 02139, USA |
| Pseudocode | Yes | Algorithm 1 Computation of p(x) and learning gradients for NADE. Input: training observation vector x and ordering o of the input dimensions. Output: p(x) and gradients of log p(x) on parameters. # Computing p(x) a1 c p(x) 1 for d from 1 to D do hd sigm (ad) p(xod =1 | xo<d) sigm (V od, hd + bod) p(x) p(x) p(xod =1 | xo<d)xod + (1 p(xod =1 | xo<d))1 xod ad+1 ad + W ,odxod end for # Computing gradients of log p(x) δa D 0 δc 0 for d from D to 1 do δbod p(xod =1 | xo<d) xod δV od, p(xod =1 | xo<d) xod h d δhd p(xod =1 | xo<d) xod V od, δc δc + δhd hd (1 hd) δW ,od δadxod δad 1 δad + δhd hd (1 hd) end for return p(x), δb, δV , δc, δW |
| Open Source Code | Yes | The code to reproduce the experiments of the paper is available on Git Hub1. Our implementation is done using Theano (Team et al., 2016). 1. http://github.com/Marc Cote/NADE |
| Open Datasets | Yes | These data sets were mostly taken from the LIBSVM data sets web site2, except for OCR-letters3 and NIPS-0-12 4. Code to download these data sets is available here: http://info.usherbrooke.ca/hlarochelle/ code/nade.tar.gz. 2. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ 3. http://ai.stanford.edu/~btaskar/ocr/ 4. http://www.cs.nyu.edu/~roweis/data.html We start by considering three UCI data sets (Bache and Lichman, 2013), previously used to study the performance of other density estimators (Silva et al., 2011; Tang et al., 2012), namely: red wine, white wine and parkinsons. Following the work of Zoran and Weiss (2011), we use 8-by-8-pixel patches of monochrome natural images, obtained from the BSDS300 data set (Martin et al., 2001; Figure 9 gives examples). We also measured the ability of RNADE to model small patches of speech spectrograms, extracted from the TIMIT data set (Garofolo et al., 1993). |
| Dataset Splits | Yes | For these experiments, we only consider tractable distribution estimators, where we can evaluate p(x) on test items exactly. ... The validation set performance was used to select the learning rate from {0.005, 0.0005, 0.00005}, and the number of iterations over the training set from {100, 500, 1000}. ... Due to the small size of the data sets (see Table 5), we used 10-folds, using 90% of the data for training, and 10% for testing. ... We used the remaining 20 images in the training subset as validation data. We used 1000 random patches from the validation subset to early-stop training of RNADE. We measured the performance of each model by their log-likelihood on one million patches drawn randomly from the test subset of 100 images not present in the training data. ... We fitted the models using the standard TIMIT training subset, which includes recordings from 605 speakers of American English. We compare RNADE with a mixture of Gaussians by measuring their log-likelihood on the complete TIMIT core-test data set: a held-out set of 25 speakers. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It mentions Theano, a software framework, but not the underlying hardware. |
| Software Dependencies | No | Our implementation is done using Theano (Team et al., 2016). ... The Adam optimizer (Kingma and Ba, 2015) was used with a learning rate of 10 4. The paper mentions software frameworks and optimizers but does not provide specific version numbers for these or any other libraries. |
| Experiment Setup | Yes | For these experiments, we only consider tractable distribution estimators, where we can evaluate p(x) on test items exactly. ... The sigmoid activation function was used for the hidden layer, of size 500. Much like for FVSBN, training relied on stochastic gradient descent and the validation set was used for early stopping, as well as for choosing the learning rate from {0.05, 0.005, 0.0005}, and the decreasing schedule constant γ from {0,0.001,0.000001}. ... The rectified linear activation function was used for the hidden layer, also of size 500. Minibatch gradient descent was used for training, with minibatches of size 100. The initial learning rate, chosen among {0.016, 0.004, 0.001, 0.00025, 0.0000675}, was linearly decayed to zero over the course of 100,000 parameter updates. ... The rectified linear activation function was used for all hidden layers. Minibatch gradient descent was used for training, with minibatches of size 1000. Training consisted of 200 iterations of 1000 parameter updates. Each hidden layer was pre-trained according to Algorithm 2. ... The Adam optimizer (Kingma and Ba, 2015) was used with a learning rate of 10−4. Early stopping was used with a look ahead of 10 epochs, using Equation 34 to get a stochastic estimate of the validation set average log-likelihood. |