Deep Learning in Target Space

Authors: Michael Fairbank, Spyridon Samothrakis, Luca Citi

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we show the performance of the target-space method on the Two-Spirals benchmark problem, and on four classic small-image vision benchmark problems for convolutional neural networks, and then we demonstrate the target-space method on some bit-stream manipulation tasks and a sentiment-analysis task for recurrent neural networks. The experiments show the effectiveness of the target-space method, in ability to train deep networks and produce improved generalisation.
Researcher Affiliation Academia Michael Fairbank EMAIL Spyridon Samothrakis EMAIL Luca Citi EMAIL Department of Computer Science and Electronic Engineering University of Essex Colchester, CO4 3SQ, UK
Pseudocode Yes Algorithm 1 Feed-Forward Dynamics... Algorithm 2 Converting Targets to Weights, in a FFNN, with Sequential Cascade Untangling (SCU)... Algorithm 3 Calculation of Learning Gradient in Target Space... Algorithm 4 Recurrent NN Dynamics... Algorithm 5 Conversion of Targets to Weights for a RNN (using SCU)
Open Source Code Yes Source code for experiments is available at https://github.com/mikefairbank/dlts_paper_code
Open Datasets Yes The MNIST digit dataset: 60,000 training samples of 28-by-28 grey-scale pixellated hand-written numeric digits, each labelled from 0-9, and a test set of 10,000 samples (Le Cun et al., 2010). MNIST-Fashion dataset: 60,000 28x28 grayscale images of 10 labelled fashion categories, along with a test set of 10,000 images (Xiao et al., 2017). CIFAR10 dataset: 50,000 32x32 colour training images, labelled over 10 categories, and 10,000 test images (Krizhevsky et al., 2009). CIFAR100 dataset: 50,000 32x32 colour training images, labelled over 100 categories, and 10,000 test images (Krizhevsky et al., 2009). RNN Movie-Review Sentiment Analysis: In this final experiment we trained a RNN to solve the natural-language processing task of sentiment analysis for 50,000 movies reviews from the Internet Movie Database (IMDB) website.
Dataset Splits Yes The MNIST digit dataset: 60,000 training samples of 28-by-28 grey-scale pixellated hand-written numeric digits, each labelled from 0-9, and a test set of 10,000 samples (Le Cun et al., 2010). MNIST-Fashion dataset: 60,000 28x28 grayscale images of 10 labelled fashion categories, along with a test set of 10,000 images (Xiao et al., 2017). CIFAR10 dataset: 50,000 32x32 colour training images, labelled over 10 categories, and 10,000 test images (Krizhevsky et al., 2009). CIFAR100 dataset: 50,000 32x32 colour training images, labelled over 100 categories, and 10,000 test images (Krizhevsky et al., 2009). IMDB dataset was obtained from the Tensorflow/Keras packages, with a 50-50 training/test-set split, using options of only including the top 5000 most frequent words, and padding/truncating all reviews to a length of 500 words each.
Hardware Specification Yes All experiments were implemented using Python and Tensorflow v1.14 on a Tesla K80 GPU.
Software Dependencies Yes All experiments were implemented using Python and Tensorflow v1.14 on a Tesla K80 GPU.
Experiment Setup Yes The Two-Spirals classification problem... used gradient-descent with optimal learning rates empirically determined as η = 10 for target space and η = 0.1 for weight space. The learning rate used was 0.01, which was found to be beneficial to both target space and weight space on this problem... With target space, λ = 0.001 was used for equation (7), and initial targets were randomised using a truncated normal distribution with σ = 1... For weight-space learning, the weights were randomised using the method of Glorot and Bengio (2010). For CNN Experiments... All non-final layers used the leaky-relu activation function... trained with the cross-entropy loss function and the Adam optimizer, with learning rate 0.001 for weight-space learning, and 0.01 for target-space learning. Minibatches of size nb = 100 were randomly generated at each iteration... A fixed mini-batch of size nb = 100 was used for the targets input matrix X. In weight space, the weight initialisation used magnitudes defined by He et al. (2015)... In target space, the targets values were all initially randomised with a truncated normal distribution with standard deviation 0.1... λ = 0.1 was used in equation (7). When dropout was used, it was applied with a dropout probability of 0.2 to all non-final dense layers, and all even-numbered convolutional layers. When batch normalisation was used, it was applied to every convolutional layer and to every non-final dense layer. For Bit-Stream Recurrent Neural-Network Experiments... The neural network has architecture 1 (N + 3) 2, with the hidden layer being fully connected to itself with recurrent connections... The hidden layer used tanh activation functions, and the final layer used softmax with cross-entropy loss function... A batch size of 8,000 random bit streams of length nt = N + 50 was used to train the network. Random mini-batches of size nb = 100 were used during each training iteration. A fixed mini-batch of size nb = 100 with nt = nt was used for the target-space matrices X (t). In weight space, the weight initialisation used magnitudes defined by Glorot and Bengio (2010). In target space, the targets values were randomised with a truncated normal distribution with standard deviation 1... The networks were trained with 50,000 iterations of Adam optimiser, with learning rate 0.001 for both weight-space and target space, and with λ = 0.1 for target space. For RNN Movie-Review Sentiment Analysis... All neural networks were trained using Adam with learning rate 0.001, and mini-batch sizes of nb = 40. The target-space algorithm used λ = 0.001. Weights and targets were initially randomised as in the previous subsection. Word embeddings were also initially randomised (using a normal distribution with µ = 0 and σ = 0.1). ...a fixed sequence of target-space input matrices X (t) was chosen, for a sequence length of just nt = 60, and mini-batch size nb = 40.