Revisiting Feature Prediction for Learning Visual Representations from Video
Authors: Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, Nicolas Ballas
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model s parameters; e.g., using a frozen backbone. Our largest model, a Vi T-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on Image Net1K. |
| Researcher Affiliation | Academia | The paper states 'Anonymous authors Paper under double-blind review', therefore no institutional affiliations are provided to determine the affiliation type. |
| Pseudocode | No | The paper describes the methodology in narrative text and uses diagrams (Figure 2, Figure 3) to illustrate the architecture and training process, but it does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We combine several public datasets to construct an unsupervised video pretraining dataset, which we refer to as Video Mix2M. Specifically, we combine the videos from How To100M (HT) (Miech et al., 2019), Kinetics-400/600/700 (K710) (Kay et al., 2017), and Something-Something-v2 (SSv2) (Goyal et al., 2017)... Pretrained models are evaluated on downstream video and image tasks... action recognition on Kinetics-400 (K400) (Kay et al., 2017), motion classification on Something Something-v2 (SSv2) (Goyal et al., 2017), and action localization on AVA (Gu et al., 2018). For static image tasks, we explore object recognition on Image Net (Russakovsky et al., 2015), scene classification on Places205 (Zhou et0 al., 2014), and fine-grained recognition on i Naturalist 2021 (Van Horn et al., 2018). |
| Dataset Splits | Yes | We examine the label-efficiency of V-JEPA compared to other self-supervised video models by measuring the ability of the pretrained backbones to adapt to downstream tasks with few labels. Specifically, we investigate the performance of the frozen models on Kinetics-400 and Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. |
| Hardware Specification | Yes | Table 9: pretraining hyper-parameters for V-JEPA. ... hardware dtype bfloat16 accelerator A100 80G |
| Software Dependencies | No | We use Adam W (Loshchilov & Hutter, 2017) to optimize the x-encoder and predictor weights. The paper mentions this optimizer but does not specify version numbers for programming languages or other key software libraries. |
| Experiment Setup | Yes | Table 9: pretraining hyper-parameters for V-JEPA. ... Table 10: Frozen Evaluation hyper-parameters. ... Table 11: Frozen Detection hyper-parameters. ... Table 12: Finetuning Evaluation hyper-parameters. These tables provide specific details including batch size, learning rates, epochs, weight decay, masking configurations, and other parameters for pretraining and various evaluation setups. |