Multimodal Learning with Deep Boltzmann Machines
Authors: Nitish Srivastava, Ruslan Salakhutdinov
JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on bi-modal image-text and audio-video data. The fused representation achieves good classification results on the MIR-Flickr data set matching or outperforming other deep models as well as SVM based models that use Multiple Kernel Learning. We further demonstrate that this multimodal model helps classification and retrieval even when only unimodal data is available at test time. |
| Researcher Affiliation | Academia | Nitish Srivastava EMAIL Department of Computer Science University of Toronto 10 Kings College Road, Rm 3302 Toronto, Ontario, M5S 3G4, Canada. Ruslan Salakhutdinov EMAIL Department of Statistics and Computer Science University of Toronto 10 Kings College Road, Rm 3302 Toronto, Ontario, M5S 3G4, Canada. |
| Pseudocode | Yes | Algorithm 1 Learning Procedure for a Multimodal Deep Boltzmann Machine. |
| Open Source Code | No | The extracted features are publicly available at http://www.cs.toronto.edu/~nitish/multimodal. (Footnote 5 in Section 6.1). This link provides extracted features, not the source code for the methodology described in the paper. |
| Open Datasets | Yes | We used the MIR Flickr Data set (Huiskes and Lew, 2008) in our experiments. ... We combined several data sets in this experiment. CUAVE (Patterson et al., 2002): ... AVLetters (Matthews et al., 2002): ... AVLetters 2 (Cox et al., 2008): ... TIMIT (Fisher et al., 1986): |
| Dataset Splits | Yes | From the 25,000 annotated images we use 10,000 images for training, 5,000 for validation and 10,000 for testing, following Huiskes et al. (2010). |
| Hardware Specification | No | The paper mentions a "fast GPU implementation" but does not specify any particular GPU model or other hardware components used for the experiments. |
| Software Dependencies | No | We used publicly available code (Vedaldi and Fulkerson, 2008; Bastan et al., 2010) for extracting these features. This refers to third-party tools used for feature extraction, but does not provide specific version numbers for these tools or for the authors' own implementation dependencies. |
| Experiment Setup | Yes | The image pathway consists of a Gaussian RBM with 3857 linear visible units and 1024 hidden units. ... The text pathway consists of a Replicated Softmax Model with 2000 visible units and 1024 hidden units... The joint layer contains 2048 hidden units. All hidden units are binary. Each Gaussian visible unit was set to have unit variance (σi = 1) which was kept fixed and not learned. Each layer of weights was pretrained using CD-n where n was gradually increased from 1 to 20. All word count vectors were normalized so that they sum to one. ... we retained each unit with probability p = 0.8. ... we typically used 5 mean-field updates. |