Blockwise Self-Supervised Learning at Scale

Authors: Shoaib Siddiqui, David Krueger, Yann LeCun, Stephane Deny

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a Res Net-50 with Barlow Twins loss function at each block performs almost as well as end-to-end backpropagation on Image Net: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of selfsupervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience. Code to reproduce our experiments is available at: https://github.com/shoaibahmed/blockwise_ssl.
Researcher Affiliation Collaboration Shoaib Ahmed Siddiqui EMAIL University of Cambridge, UK David Krueger EMAIL University of Cambridge, UK Yann Le Cun EMAIL Meta-FAIR, NY, USA New York University, NY, USA Stéphane Deny stephane.deny@aalto.fi Aalto University, Espoo, Finland
Pseudocode Yes We provide the Py Torch pseudocode for blockwise training of models in Appendix E. Algorithm 1: Py Torch pseudocode for our blockwise training scheme.
Open Source Code Yes Code to reproduce our experiments is available at: https://github.com/shoaibahmed/blockwise_ssl.
Open Datasets Yes We use a Res Net-50 network...on Image Net... Deng et al., 2009... We evaluate the robustness of our best model...using the Image Net-C benchmark (Hendrycks & Dietterich, 2019)... trained the Res Net-50 model on CIFAR-10 from scratch.
Dataset Splits No The paper mentions using ImageNet, ImageNet-C, and CIFAR-10 datasets. It discusses training models and evaluating performance (e.g., top-1 accuracy), but does not explicitly state the specific train/validation/test split percentages or sample counts used for these datasets in the main text. It implies the use of standard splits but does not provide specific details.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing specifications used for running the experiments.
Software Dependencies No We use a Res Net-50 network and adapt the Barlow Twins (Zbontar et al., 2021) codebase2 to a blockwise training paradigm... We implemented Sim CLR loss function (Chen et al., 2020a) within the Barlow Twins codebase and directly adapted the official Vic Reg implementation4 for our experiments with Vic Reg (Bardes et al., 2022). Algorithm 1: Py Torch pseudocode for our blockwise training scheme. The paper mentions various software components like PyTorch and different codebases (Barlow Twins, SimCLR, VicReg) but does not provide specific version numbers for any of them.
Experiment Setup Yes All our results are for a Res Net-50 trained for 300 epochs, using the LARS optimizer with a batch size of 2048. We use a cosine learning rate decay with an initial learning rate of 0.2. The projector is a three-layered MLP with 8192 hidden units and 8192 output units. In the supervised training case, we use the same training procedure as for the self-supervised case, except that we restrict the output layer of the projector to have 1000 output units, corresponding to the 1000 classes of Image Net, and apply a simple cross-entropy loss to the output of the projector for classification. We pick a particular σ and add Gaussian noise with zero mean and the chosen standard deviation to the activations of the model before the beginning of every block. Injecting independent Gaussian noise to each of the feature values with µ = 0 and σ = 0.25 gave a boost in performance of 0.5%.