Representation Alignment in Neural Networks

Authors: Ehsan Imani, Wei Hu, Martha White

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we show that, after training, neural network representations align their top singular vectors to the targets. We investigate this representation alignment phenomenon in a variety of neural network architectures and find that (a) alignment emerges across a variety of different architectures and optimizers, with more alignment arising from depth (b) alignment increases for layers closer to the output and (c) existing high-performance deep CNNs exhibit high levels of alignment. We then highlight why alignment between the top singular vectors and the targets can speed up learning and show in a classic synthetic transfer problem that representation alignment correlates with positive and negative transfer to similar and dissimilar tasks. A demo is available at https://github.com/EhsanEI/rep-align-demo.
Researcher Affiliation Academia Ehsan Imani EMAIL University of Alberta Wei Hu EMAIL University of Michigan Martha White EMAIL University of Alberta CIFAR AI Chair
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It presents propositions and mathematical formulas but no step-by-step code-like procedures.
Open Source Code Yes A demo is available at https://github.com/EhsanEI/rep-align-demo.
Open Datasets Yes Availability of large datasets like Image Net (Russakovsky et al., 2015) and the News Dataset for Word2Vec (Mikolov et al., 2013) provides suitable source tasks that facilitate using neural networks for feature construction for Computer Vision and Natural Language Processing (NLP) tasks (Kornblith et al., 2019; Oquab et al., 2014; Devlin et al., 2018; Pennington et al., 2014). We sampled 1000 points from the UCI CT Position Dataset (Graf et al., 2011). In Figure 2 (a) we sampled 10000 points from the first two classes of the MNIST dataset (5000 points from each class). We take a CNN first trained on the large Image Net dataset (Russakovsky et al., 2015), and use features given by the last hidden layer to train a linear model on Cifar10 and Cifar100 (Krizhevsky et al., 2009). The dataset is Office-31 (Saenko et al., 2010) which has three domains Amazon, Webcam, and DSLR.
Dataset Splits Yes We sampled 1000 points from the UCI CT Position Dataset (Graf et al., 2011) and created three sets of 1024-dimensional features. In Figure 2 (a) we sampled 10000 points from the first two classes of the MNIST dataset (5000 points from each class). For MNIST and CT Position, we pick 10000 random points and train a neural network with three hidden layers of width 128 using Adam with batch-size 64 until convergence. We create 10000 inputs by sampling the variables A-F randomly from [0, 1) and compute the outputs for the source and target tasks to create a dataset of size 10000 for these tasks. This time we choose 100 out of the 10000 points for the related and unrelated task. and compared the performance of a linear model on the original features, hidden representation at initialization, and hidden representation after training on the S. The test error is evaluated on a new dataset of size 1000 for evaluating the model s generalization. We first split Image Net into two source tasks of Image Net-Artificial with 551 classes and Image Net-Natural with 449 classes. The target tasks are all 6 binary classification tasks between artificial classes of Cifar10 and all 15 binary classification tasks between natural classes of Cifar10. we split Cifar10 to two target tasks of Cifar10-Natural with six classes and Cifar10-Artificial with four classes.
Hardware Specification Yes Batch-Size: 1024 (distributed over 4 V100 GPUs) for pre-training and 256 (on 1 K80 GPU) for fine-tuning; [...] Batch-Size: 256 (distributed over 2 V100 GPUs) for pre-training and 64 (on 1 V100 GPU) for fine-tuning; [...] Batch-Size: 256 (on 1 V100 GPU) for pre-training and 256 (on 1 V100 GPU) for fine-tuning;
Software Dependencies No The paper mentions specific optimizers like Adam and SGD, and models like ResNet and T2T-ViT, but it does not specify version numbers for any libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python 3.x).
Experiment Setup Yes We trained a neural network with three hidden layers as wide as the input layer using Adam (Kingma & Ba, 2014) for 1000 epochs with batches of size 64 on the shuffled labels. We swept over four different step-size values in {0.01, 0.1, 1, 10} and chose the best performing model. In the first, we fix the hidden layer width to 128, the optimizer to Adam and train networks of different depth. In the second, we set the depth to 4, the optimizer to Adam and train networks of different hidden layer width. In the third, we set the depth to 4, the hidden layer width to 128 and train networks with different optimizers. The fourth and fifth settings consider other activations (tanh, PRe LU, Leaky Re LU, and linear) and batch-sizes (32, 128, and 256). Optimizer: SGD; Momentum: 0.9; Batch-Size: 1024 (distributed over 4 V100 GPUs) for pre-training and 256 (on 1 K80 GPU) for fine-tuning; Initial Learning Rate: 0.1 for pre-training and 0.01 for fine-tuning; Learning Rate Decay Schedule: multiplied by 0.1 after each 30 epochs; Total Number of Epochs: 40 for pre-training and 10 for fine-tuning; Weight Decay: 1e-4; Image Transformations: resize to 256x256 and random cropping and horizontal flipping augmentation.