reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Authors: Valerii Likhosherstov, Anurag Arnab, Krzysztof Marcin Choromanski, Mario Lucic, Yi Tay, Mostafa Dehghani

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By co-training on different tasks of a single modality, we are able to achieve significant accuracy improvements on 5 standard videoand audio-classification datasets. Furthermore, co-training Poly Vi T on multiple modalities and tasks leads to a parameter-efficient model which generalizes across multiple domains. In particular, our multi-modal Poly Vi T trained on 9 datasets across 3 modalities uses 8.3 times fewer parameters and outperforms a state-of-the-art single-task baseline on 2 of these datasets, whilst achieving competitive performance on the others. Finally, this simple and practical approach necessitates less hyperparameter tuning as the per-task hyperparameters can be readily reused. To facilitate further research, we have released code at https://github.com/google-research/scenic.
Researcher Affiliation	Collaboration	Valerii Likhosherstov EMAIL University of Cambridge Anurag Arnab EMAIL Google Research Krzysztof Choromanski EMAIL Google Research Mario Lucic EMAIL Google Research Yi Tay EMAIL Google Research Adrian Weller EMAIL University of Cambridge & The Alan Turing Institute Mostafa Dehghani EMAIL Google Research
Pseudocode	No	The paper describes the model architecture and procedures using mathematical equations (e.g., Eq. 1, 3, 4, 5) and textual descriptions, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, step-by-step code-like formatting.
Open Source Code	Yes	To facilitate further research, we have released code at https://github.com/google-research/scenic.
Open Datasets	Yes	For image classification, we use Image Net-1K, CIFAR-10 and -100, Oxford-IIIT Pets, and RESISC45. For video, we use Kinetics 400 and Moments in Time, and for audio, Audio Set and VGGSound. Exhaustive details of these datasets are in Appendix A.
Dataset Splits	Yes	We take 2% of CIFAR 10/100 train sets for validation, 10% of Pets train set for validation and 1% of Image Net-1k train set for validation. We use standard test sets for these datasets. For RESISC45, we use 20% of the train set for validation and 20% for testing. We use standard train, validation and test sets for video and audio tasks.
Hardware Specification	No	Constructing minibatches from a single task (where each example has the same number of tokens) has further computational advantages on GPUor TPU-accelerators, as tokens do not need to be padded to a maximum sequence length.
Software Dependencies	No	Poly Vi T is implemented in Scenic (Dehghani et al., 2021b) and the code for training and evaluation of the model is available at https://github.com/google-research/scenic.
Experiment Setup	Yes	We set the training hyperparameters for these tasks (and those of the single-task baselines) using the values reported by Dosovitskiy et al. (2021) for image tasks, Arnab et al. (2021) for video tasks and Nagrani et al. (2021) for audio tasks (detailed in Appendix A). ... We set the training hyperparameters for these tasks (and those of the single-task baselines) using the values reported by Dosovitskiy et al. (2021) for image tasks, Arnab et al. (2021) for video tasks and Nagrani et al. (2021) for audio tasks (detailed in Appendix A). Note that the audio-only model of Nagrani et al. (2021), which we use as our baseline, is identical to AST Gong et al. (2021), and we choose it since the authors have evaluated on more datasets. We perform experiments with two standard transformer encoder configurations: Base (number of layers, L = 12, hidden dimension d = 768, attention heads h = 12) and Large (L = 24, d = 1024, h = 16) following Devlin et al. (2019); Dosovitskiy et al. (2021).