A Theoretical Analysis of Self-Supervised Learning for Vision Transformers
Authors: Yu Huang, Zixin Wen, Yuejie Chi, Yingbin Liang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The main body of the paper presents only theoretical results, with all proofs provided in the appendices. Additionally, the appendices include proof sketches that offer intuitive explanations of the proof steps. The appendix also contains experimental results, with detailed descriptions of the experimental settings to facilitate result reproduction. |
| Researcher Affiliation | Academia | University of Pennsylvania Carnegie Mellon University The Ohio State University |
| Pseudocode | No | The paper focuses on theoretical analysis and proof techniques, describing gradient descent dynamics and attention correlations, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The reproducibility statement mentions "detailed descriptions of the experimental settings to facilitate result reproduction" but does not include any explicit statement of code release, a link to a repository, or mention of code in supplementary materials for the methodology described in the paper. |
| Open Datasets | Yes | Setup. In this work, we compare the performance of Vi T-B/16 encoder pre-trained on Image Net1K (Russakovsky et al., 2015) among the following four models: masked reconstruction model (MAE), contrastive learning model (Mo Co v3 (Chen et al., 2021b)), other self-supervised model (DINO Caron et al. (2021)), and supervised model (Dei T Touvron et al. (2021)). |
| Dataset Splits | No | The paper mentions using ViT-B/16 encoder pre-trained on ImageNet1K and analyzing attention focus across 152 example images, but it does not specify any training/test/validation splits for these images or how they were selected from the dataset. |
| Hardware Specification | No | The paper, including its experimental section and reproducibility statement, does not provide any specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper, including its experimental section and reproducibility statement, does not list any specific software dependencies with version numbers. |
| Experiment Setup | No | The paper describes the models used (MAE, Mo Co, DINO, DeiT) and the focus of the analysis (12 different attention heads in the last layer of ViT-B on 152 example images), but it lacks specific experimental setup details such as hyperparameters (e.g., learning rates, batch sizes) for their analysis or model training. |