Prismer: A Vision-Language Model with Multi-Task Experts
Authors: Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer. |
| Researcher Affiliation | Collaboration | Shikun Liu1,2 Linxi Fan2 Edward Johns1 Zhiding Yu2 Chaowei Xiao2,3 Anima Anandkumar2,4 1Imperial College London 2NVIDIA 3University of Wisconsin, Madison 4Caltech |
| Pseudocode | No | The paper describes the model architecture and training process in text and diagrams (Figures 1, 2, 3), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/NVlabs/prismer. |
| Open Datasets | Yes | We construct our pre-training data from the following datasets: two in-domain datasets: COCO (Lin et al., 2014) and Visual Genome (Krishna et al., 2017); and three web datasets: Conceptual Captions (Sharma et al., 2018), SBU captions (Ordonez et al., 2011), and a much noisier Conceptual 12M (Changpinyo et al., 2021). The web datasets are pre-filtered and re-captioned by a pretrained image captioner (Li et al., 2022). The pre-training datasets include 11M unique images or 12.7M image/alt-text pairs. All datasets are available publicly and have been widely used for pre-training many VLMs (Li et al., 2021; 2022; Chen et al., 2020). |
| Dataset Splits | Yes | We fine-tune our models on COCO Caption dataset (Chen et al., 2015) on a widely adopted Karpathy split (Karpathy & Fei-Fei, 2015), with the standard cross-entropy loss, and without metric-specific optimisation (Vedantam et al., 2015). We evaluate the fine-tuned models on the COCO Caption Karpathy test split and No Caps (Agrawal et al., 2019) validation set. We also evaluate our models on the VQAv2 dataset (Antol et al., 2015), with additional training samples from Visual Genome (Krishna et al., 2017) following (Li et al., 2022). |
| Hardware Specification | Yes | The largest model variant, Prismer LARGE, only requires 8 days of training on 32 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Adam W optimiser' and 'Automatic Mixed Precision (AMP) with fp16 precision' and 'Ze RO Stage 2 technique' but does not specify version numbers for these software components or any other libraries/frameworks. |
| Experiment Setup | Yes | Table 6: The detailed list of hyper-parameters and training strategy. To ensure reproducibility, we have included a list of all hyper-parameters used in our experiments. These same hyper-parameters are applied to both the BASE and LARGE model variants. The table then lists specific values for Optimiser, LR Schedule, Weight Decay, Warmup Steps, Initial LR, Resolution, Epochs, and Batch Size for pre-training and various fine-tuning tasks. |