Prismer: A Vision-Language Model with Multi-Task Experts

Authors: Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.
Researcher Affiliation Collaboration Shikun Liu1,2 Linxi Fan2 Edward Johns1 Zhiding Yu2 Chaowei Xiao2,3 Anima Anandkumar2,4 1Imperial College London 2NVIDIA 3University of Wisconsin, Madison 4Caltech
Pseudocode No The paper describes the model architecture and training process in text and diagrams (Figures 1, 2, 3), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/NVlabs/prismer.
Open Datasets Yes We construct our pre-training data from the following datasets: two in-domain datasets: COCO (Lin et al., 2014) and Visual Genome (Krishna et al., 2017); and three web datasets: Conceptual Captions (Sharma et al., 2018), SBU captions (Ordonez et al., 2011), and a much noisier Conceptual 12M (Changpinyo et al., 2021). The web datasets are pre-filtered and re-captioned by a pretrained image captioner (Li et al., 2022). The pre-training datasets include 11M unique images or 12.7M image/alt-text pairs. All datasets are available publicly and have been widely used for pre-training many VLMs (Li et al., 2021; 2022; Chen et al., 2020).
Dataset Splits Yes We fine-tune our models on COCO Caption dataset (Chen et al., 2015) on a widely adopted Karpathy split (Karpathy & Fei-Fei, 2015), with the standard cross-entropy loss, and without metric-specific optimisation (Vedantam et al., 2015). We evaluate the fine-tuned models on the COCO Caption Karpathy test split and No Caps (Agrawal et al., 2019) validation set. We also evaluate our models on the VQAv2 dataset (Antol et al., 2015), with additional training samples from Visual Genome (Krishna et al., 2017) following (Li et al., 2022).
Hardware Specification Yes The largest model variant, Prismer LARGE, only requires 8 days of training on 32 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using 'Adam W optimiser' and 'Automatic Mixed Precision (AMP) with fp16 precision' and 'Ze RO Stage 2 technique' but does not specify version numbers for these software components or any other libraries/frameworks.
Experiment Setup Yes Table 6: The detailed list of hyper-parameters and training strategy. To ensure reproducibility, we have included a list of all hyper-parameters used in our experiments. These same hyper-parameters are applied to both the BASE and LARGE model variants. The table then lists specific values for Optimiser, LR Schedule, Weight Decay, Warmup Steps, Initial LR, Resolution, Epochs, and Batch Size for pre-training and various fine-tuning tasks.