ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning
Authors: Sucheng Ren, Hongru Zhu, Chen Wei, Yijiang Li, Alan Yuille, Cihang Xie
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning. For example, when trained with the Vi T-B backbone, ARVideo competitively attains 81.2% on Kinetics-400 and 70.9% on Something Something V2, which are on par with the strong benchmark set by Video MAE. |
| Researcher Affiliation | Academia | Sucheng Ren EMAIL Johns Hopkins University Hongru Zhu EMAIL Johns Hopkins University Chen Wei EMAIL Johns Hopkins University Yijiang Li EMAIL Johns Hopkins University Alan Yuille EMAIL Johns Hopkins University Cihang Xie EMAIL UC Santa Cruz |
| Pseudocode | No | The paper describes the methodology in prose and through figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We primarily evaluate ARVideo on Kinetics-400 (Kay et al., 2017) and Something-Something V2 (Goyal et al., 2017). ... Additionally, we assess the feature transferability on Av A v2.2 (Gu et al., 2018) and HMDB (Kuehne et al., 2011). |
| Dataset Splits | Yes | Specifically, Kinetics-400 contains 400 classes and 260k videos of 10s, with 240k for training and 20k for validation; Something-Something V2 contains 174 classes with 169k videos for training and 25k for validation. ... Av A v2.2 is a human action localization dataset with 211k videos for training and 57k for validation; HMDB is a small video dataset with 3.5k videos for training and 1.5k videos for validation. |
| Hardware Specification | Yes | We report the training time and GPU memory usage in Table 5 (with Vi T-B trained on Kinetics-400 for 800 epochs, using 8 A6000). |
| Software Dependencies | No | The paper mentions using the Adam W optimizer and Vi T-B as the backbone, but does not provide specific version numbers for software libraries like PyTorch, CUDA, or other dependencies. |
| Experiment Setup | Yes | Training Hyperparameters: We employ the Adam W optimizer with a weight decay of 0.05 and a base learning rate of 6e-4. The training schedule comprises a 40-epoch warmup phase followed by a cosine decay learning rate schedule. Finetuning Hyperparameters: ... we employ the Adam W optimizer with a base learning rate of 5e 4 and a weight decay of 0.05. The batch size is set to 512, and we utilize a cosine decay learning rate schedule with 5 warmup epochs over a total of 40 training epochs. Our data augmentation strategies include repeated augmentation (factor of 2) and Rand Augment with parameters (9, 0.5), while flip augmentation is disabled. Additionally, we apply label smoothing (0.1), mixup (0.8), and cutmix (1.0). The drop path rate is configured at 0.1, and no dropout is applied for Something-Something V2. The layer-wise learning rate decay factor is set to 0.75. For the Kinetics-400 dataset, most settings remain unchanged except for the following adjustments: the base learning rate is increased to 1e 3, flip augmentation is enabled, and the total number of training epochs is extended to 75. |