reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Blockwise Self-Supervised Learning at Scale

Authors: Shoaib Siddiqui, David Krueger, Yann LeCun, Stephane Deny

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a Res Net-50 with Barlow Twins loss function at each block performs almost as well as end-to-end backpropagation on Image Net: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classiﬁcation accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of diﬀerent components within our method and explore a variety of adaptations of selfsupervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience. Code to reproduce our experiments is available at: https://github.com/shoaibahmed/blockwise_ssl.
Researcher Affiliation	Collaboration	Shoaib Ahmed Siddiqui EMAIL University of Cambridge, UK David Krueger EMAIL University of Cambridge, UK Yann Le Cun EMAIL Meta-FAIR, NY, USA New York University, NY, USA Stéphane Deny stephane.deny@aalto.ﬁ Aalto University, Espoo, Finland
Pseudocode	Yes	We provide the Py Torch pseudocode for blockwise training of models in Appendix E. Algorithm 1: Py Torch pseudocode for our blockwise training scheme.
Open Source Code	Yes	Code to reproduce our experiments is available at: https://github.com/shoaibahmed/blockwise_ssl.
Open Datasets	Yes	We use a Res Net-50 network...on Image Net... Deng et al., 2009... We evaluate the robustness of our best model...using the Image Net-C benchmark (Hendrycks & Dietterich, 2019)... trained the Res Net-50 model on CIFAR-10 from scratch.
Dataset Splits	No	The paper mentions using ImageNet, ImageNet-C, and CIFAR-10 datasets. It discusses training models and evaluating performance (e.g., top-1 accuracy), but does not explicitly state the specific train/validation/test split percentages or sample counts used for these datasets in the main text. It implies the use of standard splits but does not provide specific details.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing specifications used for running the experiments.
Software Dependencies	No	We use a Res Net-50 network and adapt the Barlow Twins (Zbontar et al., 2021) codebase2 to a blockwise training paradigm... We implemented Sim CLR loss function (Chen et al., 2020a) within the Barlow Twins codebase and directly adapted the oﬃcial Vic Reg implementation4 for our experiments with Vic Reg (Bardes et al., 2022). Algorithm 1: Py Torch pseudocode for our blockwise training scheme. The paper mentions various software components like PyTorch and different codebases (Barlow Twins, SimCLR, VicReg) but does not provide specific version numbers for any of them.
Experiment Setup	Yes	All our results are for a Res Net-50 trained for 300 epochs, using the LARS optimizer with a batch size of 2048. We use a cosine learning rate decay with an initial learning rate of 0.2. The projector is a three-layered MLP with 8192 hidden units and 8192 output units. In the supervised training case, we use the same training procedure as for the self-supervised case, except that we restrict the output layer of the projector to have 1000 output units, corresponding to the 1000 classes of Image Net, and apply a simple cross-entropy loss to the output of the projector for classiﬁcation. We pick a particular σ and add Gaussian noise with zero mean and the chosen standard deviation to the activations of the model before the beginning of every block. Injecting independent Gaussian noise to each of the feature values with µ = 0 and σ = 0.25 gave a boost in performance of 0.5%.