Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Authors: Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. In Section 6, we present our experimental results where we apply dropout to problems in different domains and compare it with other forms of regularization and model combination.
Researcher Affiliation Academia Nitish Srivastava EMAIL Geoffrey Hinton EMAIL Alex Krizhevsky EMAIL Ilya Sutskever EMAIL Ruslan Salakhutdinov EMAIL Department of Computer Science University of Toronto 10 Kings College Road, Rm 3302 Toronto, Ontario, M5S 3G4, Canada.
Pseudocode No The paper describes the model (Section 4) and training procedures (Section 5) using mathematical equations and descriptive text, but it does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes The code for reproducing these results can be obtained from http://www.cs.toronto.edu/~nitish/dropout. The implementation is GPU-based.
Open Datasets Yes We trained dropout neural networks for classification problems on data sets in different domains. We found that dropout improved generalization performance on all data sets compared to neural networks that did not use dropout. Table 1 gives a brief description of the data sets. The data sets are MNIST : A standard toy data set of handwritten digits. TIMIT : A standard speech benchmark for clean speech recognition. CIFAR-10 and CIFAR-100 : Tiny natural images (Krizhevsky, 2009). Street View House Numbers data set (SVHN) : Images of house numbers collected by Google Street View (Netzer et al., 2011). Image Net : A large collection of natural images. Reuters-RCV1 : A collection of Reuters newswire articles. Alternative Splicing data set: RNA features for predicting alternative gene splicing (Xiong et al., 2011).
Dataset Splits Yes The MNIST data set consists of 60,000 training and 10,000 test examples... We held out 10,000 random training images for validation. The SVHN data set... A validation set was constructed by taking examples from both the parts. Two-thirds of it were taken from the standard set (400 per class) and one-third from the extra set (200 per class), a total of 6000 samples. The CIFAR-10 and CIFAR-100 data sets consists of 50,000 training and 10,000 test images each... We used 5,000 of the training images for validation. The Reuters RCV1 corpus... The data was split into equal sized training and test sets. The alternative splicing data set... Results were averaged across the same 5 folds used by Xiong et al. (2011).
Hardware Specification No The implementation is GPU-based. We used the excellent CUDA libraries cudamat (Mnih, 2009) and cuda-convnet (Krizhevsky et al., 2012) to implement our networks. This only states that GPUs were used in general and mentions CUDA libraries, but does not specify any particular GPU models or other hardware details for the experiments conducted in this paper.
Software Dependencies No The implementation is GPU-based. We used the excellent CUDA libraries cudamat (Mnih, 2009) and cuda-convnet (Krizhevsky et al., 2012) to implement our networks. The open source Kaldi toolkit (Povey et al., 2011) was used to preprocess the data into logfilter banks. This text mentions specific software (cudamat, cuda-convnet, Kaldi toolkit) but does not provide specific version numbers for any of them.
Experiment Setup Yes A dropout net should typically use 10-100 times the learning rate... momentum values of 0.95 to 0.99 work quite a lot better. Typical values of c [for max-norm regularization] range from 3 to 4. Typical values of p for hidden units are in the range 0.5 to 0.8. For input layers... a typical value is 0.8. For MNIST: All dropout nets use p = 0.5 for hidden units and p = 0.8 for input units. A final momentum of 0.95 and weight constraints with c = 2 was used in all the layers. For SVHN: Dropout was applied to all the layers of the network with the probability of retaining the unit being p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5)... max-norm constraint with c = 4... momentum of 0.95. For TIMIT: probability of retention p = 0.8 in the input layers and 0.5 in the hidden layers. Max-norm constraint with c = 4... momentum of 0.95 with a high learning rate of 0.1.