Multimodal Learning with Deep Boltzmann Machines

Authors: Nitish Srivastava, Ruslan Salakhutdinov

JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on bi-modal image-text and audio-video data. The fused representation achieves good classification results on the MIR-Flickr data set matching or outperforming other deep models as well as SVM based models that use Multiple Kernel Learning. We further demonstrate that this multimodal model helps classification and retrieval even when only unimodal data is available at test time.
Researcher Affiliation Academia Nitish Srivastava EMAIL Department of Computer Science University of Toronto 10 Kings College Road, Rm 3302 Toronto, Ontario, M5S 3G4, Canada. Ruslan Salakhutdinov EMAIL Department of Statistics and Computer Science University of Toronto 10 Kings College Road, Rm 3302 Toronto, Ontario, M5S 3G4, Canada.
Pseudocode Yes Algorithm 1 Learning Procedure for a Multimodal Deep Boltzmann Machine.
Open Source Code No The extracted features are publicly available at http://www.cs.toronto.edu/~nitish/multimodal. (Footnote 5 in Section 6.1). This link provides extracted features, not the source code for the methodology described in the paper.
Open Datasets Yes We used the MIR Flickr Data set (Huiskes and Lew, 2008) in our experiments. ... We combined several data sets in this experiment. CUAVE (Patterson et al., 2002): ... AVLetters (Matthews et al., 2002): ... AVLetters 2 (Cox et al., 2008): ... TIMIT (Fisher et al., 1986):
Dataset Splits Yes From the 25,000 annotated images we use 10,000 images for training, 5,000 for validation and 10,000 for testing, following Huiskes et al. (2010).
Hardware Specification No The paper mentions a "fast GPU implementation" but does not specify any particular GPU model or other hardware components used for the experiments.
Software Dependencies No We used publicly available code (Vedaldi and Fulkerson, 2008; Bastan et al., 2010) for extracting these features. This refers to third-party tools used for feature extraction, but does not provide specific version numbers for these tools or for the authors' own implementation dependencies.
Experiment Setup Yes The image pathway consists of a Gaussian RBM with 3857 linear visible units and 1024 hidden units. ... The text pathway consists of a Replicated Softmax Model with 2000 visible units and 1024 hidden units... The joint layer contains 2048 hidden units. All hidden units are binary. Each Gaussian visible unit was set to have unit variance (σi = 1) which was kept fixed and not learned. Each layer of weights was pretrained using CD-n where n was gradually increased from 1 to 20. All word count vectors were normalized so that they sum to one. ... we retained each unit with probability p = 0.8. ... we typically used 5 mean-field updates.