reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Representation Alignment in Neural Networks

Authors: Ehsan Imani, Wei Hu, Martha White

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we show that, after training, neural network representations align their top singular vectors to the targets. We investigate this representation alignment phenomenon in a variety of neural network architectures and ﬁnd that (a) alignment emerges across a variety of diﬀerent architectures and optimizers, with more alignment arising from depth (b) alignment increases for layers closer to the output and (c) existing high-performance deep CNNs exhibit high levels of alignment. We then highlight why alignment between the top singular vectors and the targets can speed up learning and show in a classic synthetic transfer problem that representation alignment correlates with positive and negative transfer to similar and dissimilar tasks. A demo is available at https://github.com/EhsanEI/rep-align-demo.
Researcher Affiliation	Academia	Ehsan Imani EMAIL University of Alberta Wei Hu EMAIL University of Michigan Martha White EMAIL University of Alberta CIFAR AI Chair
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It presents propositions and mathematical formulas but no step-by-step code-like procedures.
Open Source Code	Yes	A demo is available at https://github.com/EhsanEI/rep-align-demo.
Open Datasets	Yes	Availability of large datasets like Image Net (Russakovsky et al., 2015) and the News Dataset for Word2Vec (Mikolov et al., 2013) provides suitable source tasks that facilitate using neural networks for feature construction for Computer Vision and Natural Language Processing (NLP) tasks (Kornblith et al., 2019; Oquab et al., 2014; Devlin et al., 2018; Pennington et al., 2014). We sampled 1000 points from the UCI CT Position Dataset (Graf et al., 2011). In Figure 2 (a) we sampled 10000 points from the ﬁrst two classes of the MNIST dataset (5000 points from each class). We take a CNN ﬁrst trained on the large Image Net dataset (Russakovsky et al., 2015), and use features given by the last hidden layer to train a linear model on Cifar10 and Cifar100 (Krizhevsky et al., 2009). The dataset is Oﬃce-31 (Saenko et al., 2010) which has three domains Amazon, Webcam, and DSLR.
Dataset Splits	Yes	We sampled 1000 points from the UCI CT Position Dataset (Graf et al., 2011) and created three sets of 1024-dimensional features. In Figure 2 (a) we sampled 10000 points from the ﬁrst two classes of the MNIST dataset (5000 points from each class). For MNIST and CT Position, we pick 10000 random points and train a neural network with three hidden layers of width 128 using Adam with batch-size 64 until convergence. We create 10000 inputs by sampling the variables A-F randomly from [0, 1) and compute the outputs for the source and target tasks to create a dataset of size 10000 for these tasks. This time we choose 100 out of the 10000 points for the related and unrelated task. and compared the performance of a linear model on the original features, hidden representation at initialization, and hidden representation after training on the S. The test error is evaluated on a new dataset of size 1000 for evaluating the model s generalization. We ﬁrst split Image Net into two source tasks of Image Net-Artiﬁcial with 551 classes and Image Net-Natural with 449 classes. The target tasks are all 6 binary classiﬁcation tasks between artiﬁcial classes of Cifar10 and all 15 binary classiﬁcation tasks between natural classes of Cifar10. we split Cifar10 to two target tasks of Cifar10-Natural with six classes and Cifar10-Artiﬁcial with four classes.
Hardware Specification	Yes	Batch-Size: 1024 (distributed over 4 V100 GPUs) for pre-training and 256 (on 1 K80 GPU) for ﬁne-tuning; [...] Batch-Size: 256 (distributed over 2 V100 GPUs) for pre-training and 64 (on 1 V100 GPU) for ﬁne-tuning; [...] Batch-Size: 256 (on 1 V100 GPU) for pre-training and 256 (on 1 V100 GPU) for ﬁne-tuning;
Software Dependencies	No	The paper mentions specific optimizers like Adam and SGD, and models like ResNet and T2T-ViT, but it does not specify version numbers for any libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python 3.x).
Experiment Setup	Yes	We trained a neural network with three hidden layers as wide as the input layer using Adam (Kingma & Ba, 2014) for 1000 epochs with batches of size 64 on the shuﬄed labels. We swept over four diﬀerent step-size values in {0.01, 0.1, 1, 10} and chose the best performing model. In the ﬁrst, we ﬁx the hidden layer width to 128, the optimizer to Adam and train networks of diﬀerent depth. In the second, we set the depth to 4, the optimizer to Adam and train networks of diﬀerent hidden layer width. In the third, we set the depth to 4, the hidden layer width to 128 and train networks with diﬀerent optimizers. The fourth and ﬁfth settings consider other activations (tanh, PRe LU, Leaky Re LU, and linear) and batch-sizes (32, 128, and 256). Optimizer: SGD; Momentum: 0.9; Batch-Size: 1024 (distributed over 4 V100 GPUs) for pre-training and 256 (on 1 K80 GPU) for ﬁne-tuning; Initial Learning Rate: 0.1 for pre-training and 0.01 for ﬁne-tuning; Learning Rate Decay Schedule: multiplied by 0.1 after each 30 epochs; Total Number of Epochs: 40 for pre-training and 10 for ﬁne-tuning; Weight Decay: 1e-4; Image Transformations: resize to 256x256 and random cropping and horizontal ﬂipping augmentation.