Revisiting Feature Prediction for Learning Visual Representations from Video

Authors: Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, Nicolas Ballas

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model s parameters; e.g., using a frozen backbone. Our largest model, a Vi T-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on Image Net1K.
Researcher Affiliation Academia The paper states 'Anonymous authors Paper under double-blind review', therefore no institutional affiliations are provided to determine the affiliation type.
Pseudocode No The paper describes the methodology in narrative text and uses diagrams (Figure 2, Figure 3) to illustrate the architecture and training process, but it does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets Yes We combine several public datasets to construct an unsupervised video pretraining dataset, which we refer to as Video Mix2M. Specifically, we combine the videos from How To100M (HT) (Miech et al., 2019), Kinetics-400/600/700 (K710) (Kay et al., 2017), and Something-Something-v2 (SSv2) (Goyal et al., 2017)... Pretrained models are evaluated on downstream video and image tasks... action recognition on Kinetics-400 (K400) (Kay et al., 2017), motion classification on Something Something-v2 (SSv2) (Goyal et al., 2017), and action localization on AVA (Gu et al., 2018). For static image tasks, we explore object recognition on Image Net (Russakovsky et al., 2015), scene classification on Places205 (Zhou et0 al., 2014), and fine-grained recognition on i Naturalist 2021 (Van Horn et al., 2018).
Dataset Splits Yes We examine the label-efficiency of V-JEPA compared to other self-supervised video models by measuring the ability of the pretrained backbones to adapt to downstream tasks with few labels. Specifically, we investigate the performance of the frozen models on Kinetics-400 and Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model.
Hardware Specification Yes Table 9: pretraining hyper-parameters for V-JEPA. ... hardware dtype bfloat16 accelerator A100 80G
Software Dependencies No We use Adam W (Loshchilov & Hutter, 2017) to optimize the x-encoder and predictor weights. The paper mentions this optimizer but does not specify version numbers for programming languages or other key software libraries.
Experiment Setup Yes Table 9: pretraining hyper-parameters for V-JEPA. ... Table 10: Frozen Evaluation hyper-parameters. ... Table 11: Frozen Detection hyper-parameters. ... Table 12: Finetuning Evaluation hyper-parameters. These tables provide specific details including batch size, learning rates, epochs, weight decay, masking configurations, and other parameters for pretraining and various evaluation setups.