reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning

Authors: Sucheng Ren, Hongru Zhu, Chen Wei, Yijiang Li, Alan Yuille, Cihang Xie

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning. For example, when trained with the Vi T-B backbone, ARVideo competitively attains 81.2% on Kinetics-400 and 70.9% on Something Something V2, which are on par with the strong benchmark set by Video MAE.
Researcher Affiliation	Academia	Sucheng Ren EMAIL Johns Hopkins University Hongru Zhu EMAIL Johns Hopkins University Chen Wei EMAIL Johns Hopkins University Yijiang Li EMAIL Johns Hopkins University Alan Yuille EMAIL Johns Hopkins University Cihang Xie EMAIL UC Santa Cruz
Pseudocode	No	The paper describes the methodology in prose and through figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We primarily evaluate ARVideo on Kinetics-400 (Kay et al., 2017) and Something-Something V2 (Goyal et al., 2017). ... Additionally, we assess the feature transferability on Av A v2.2 (Gu et al., 2018) and HMDB (Kuehne et al., 2011).
Dataset Splits	Yes	Specifically, Kinetics-400 contains 400 classes and 260k videos of 10s, with 240k for training and 20k for validation; Something-Something V2 contains 174 classes with 169k videos for training and 25k for validation. ... Av A v2.2 is a human action localization dataset with 211k videos for training and 57k for validation; HMDB is a small video dataset with 3.5k videos for training and 1.5k videos for validation.
Hardware Specification	Yes	We report the training time and GPU memory usage in Table 5 (with Vi T-B trained on Kinetics-400 for 800 epochs, using 8 A6000).
Software Dependencies	No	The paper mentions using the Adam W optimizer and Vi T-B as the backbone, but does not provide specific version numbers for software libraries like PyTorch, CUDA, or other dependencies.
Experiment Setup	Yes	Training Hyperparameters: We employ the Adam W optimizer with a weight decay of 0.05 and a base learning rate of 6e-4. The training schedule comprises a 40-epoch warmup phase followed by a cosine decay learning rate schedule. Finetuning Hyperparameters: ... we employ the Adam W optimizer with a base learning rate of 5e 4 and a weight decay of 0.05. The batch size is set to 512, and we utilize a cosine decay learning rate schedule with 5 warmup epochs over a total of 40 training epochs. Our data augmentation strategies include repeated augmentation (factor of 2) and Rand Augment with parameters (9, 0.5), while flip augmentation is disabled. Additionally, we apply label smoothing (0.1), mixup (0.8), and cutmix (1.0). The drop path rate is configured at 0.1, and no dropout is applied for Something-Something V2. The layer-wise learning rate decay factor is set to 0.75. For the Kinetics-400 dataset, most settings remain unchanged except for the following adjustments: the base learning rate is increased to 1e 3, flip augmentation is enabled, and the total number of training epochs is extended to 75.