PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Authors: Valerii Likhosherstov, Anurag Arnab, Krzysztof Marcin Choromanski, Mario Lucic, Yi Tay, Mostafa Dehghani

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By co-training on different tasks of a single modality, we are able to achieve significant accuracy improvements on 5 standard videoand audio-classification datasets. Furthermore, co-training Poly Vi T on multiple modalities and tasks leads to a parameter-efficient model which generalizes across multiple domains. In particular, our multi-modal Poly Vi T trained on 9 datasets across 3 modalities uses 8.3 times fewer parameters and outperforms a state-of-the-art single-task baseline on 2 of these datasets, whilst achieving competitive performance on the others. Finally, this simple and practical approach necessitates less hyperparameter tuning as the per-task hyperparameters can be readily reused. To facilitate further research, we have released code at https://github.com/google-research/scenic.
Researcher Affiliation Collaboration Valerii Likhosherstov EMAIL University of Cambridge Anurag Arnab EMAIL Google Research Krzysztof Choromanski EMAIL Google Research Mario Lucic EMAIL Google Research Yi Tay EMAIL Google Research Adrian Weller EMAIL University of Cambridge & The Alan Turing Institute Mostafa Dehghani EMAIL Google Research
Pseudocode No The paper describes the model architecture and procedures using mathematical equations (e.g., Eq. 1, 3, 4, 5) and textual descriptions, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, step-by-step code-like formatting.
Open Source Code Yes To facilitate further research, we have released code at https://github.com/google-research/scenic.
Open Datasets Yes For image classification, we use Image Net-1K, CIFAR-10 and -100, Oxford-IIIT Pets, and RESISC45. For video, we use Kinetics 400 and Moments in Time, and for audio, Audio Set and VGGSound. Exhaustive details of these datasets are in Appendix A.
Dataset Splits Yes We take 2% of CIFAR 10/100 train sets for validation, 10% of Pets train set for validation and 1% of Image Net-1k train set for validation. We use standard test sets for these datasets. For RESISC45, we use 20% of the train set for validation and 20% for testing. We use standard train, validation and test sets for video and audio tasks.
Hardware Specification No Constructing minibatches from a single task (where each example has the same number of tokens) has further computational advantages on GPUor TPU-accelerators, as tokens do not need to be padded to a maximum sequence length.
Software Dependencies No Poly Vi T is implemented in Scenic (Dehghani et al., 2021b) and the code for training and evaluation of the model is available at https://github.com/google-research/scenic.
Experiment Setup Yes We set the training hyperparameters for these tasks (and those of the single-task baselines) using the values reported by Dosovitskiy et al. (2021) for image tasks, Arnab et al. (2021) for video tasks and Nagrani et al. (2021) for audio tasks (detailed in Appendix A). ... We set the training hyperparameters for these tasks (and those of the single-task baselines) using the values reported by Dosovitskiy et al. (2021) for image tasks, Arnab et al. (2021) for video tasks and Nagrani et al. (2021) for audio tasks (detailed in Appendix A). Note that the audio-only model of Nagrani et al. (2021), which we use as our baseline, is identical to AST Gong et al. (2021), and we choose it since the authors have evaluated on more datasets. We perform experiments with two standard transformer encoder configurations: Base (number of layers, L = 12, hidden dimension d = 768, attention heads h = 12) and Large (L = 24, d = 1024, h = 16) following Devlin et al. (2019); Dosovitskiy et al. (2021).