reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Prismer: A Vision-Language Model with Multi-Task Experts

Authors: Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.
Researcher Affiliation	Collaboration	Shikun Liu1,2 Linxi Fan2 Edward Johns1 Zhiding Yu2 Chaowei Xiao2,3 Anima Anandkumar2,4 1Imperial College London 2NVIDIA 3University of Wisconsin, Madison 4Caltech
Pseudocode	No	The paper describes the model architecture and training process in text and diagrams (Figures 1, 2, 3), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/NVlabs/prismer.
Open Datasets	Yes	We construct our pre-training data from the following datasets: two in-domain datasets: COCO (Lin et al., 2014) and Visual Genome (Krishna et al., 2017); and three web datasets: Conceptual Captions (Sharma et al., 2018), SBU captions (Ordonez et al., 2011), and a much noisier Conceptual 12M (Changpinyo et al., 2021). The web datasets are pre-filtered and re-captioned by a pretrained image captioner (Li et al., 2022). The pre-training datasets include 11M unique images or 12.7M image/alt-text pairs. All datasets are available publicly and have been widely used for pre-training many VLMs (Li et al., 2021; 2022; Chen et al., 2020).
Dataset Splits	Yes	We fine-tune our models on COCO Caption dataset (Chen et al., 2015) on a widely adopted Karpathy split (Karpathy & Fei-Fei, 2015), with the standard cross-entropy loss, and without metric-specific optimisation (Vedantam et al., 2015). We evaluate the fine-tuned models on the COCO Caption Karpathy test split and No Caps (Agrawal et al., 2019) validation set. We also evaluate our models on the VQAv2 dataset (Antol et al., 2015), with additional training samples from Visual Genome (Krishna et al., 2017) following (Li et al., 2022).
Hardware Specification	Yes	The largest model variant, Prismer LARGE, only requires 8 days of training on 32 NVIDIA V100 GPUs.
Software Dependencies	No	The paper mentions using 'Adam W optimiser' and 'Automatic Mixed Precision (AMP) with fp16 precision' and 'Ze RO Stage 2 technique' but does not specify version numbers for these software components or any other libraries/frameworks.
Experiment Setup	Yes	Table 6: The detailed list of hyper-parameters and training strategy. To ensure reproducibility, we have included a list of all hyper-parameters used in our experiments. These same hyper-parameters are applied to both the BASE and LARGE model variants. The table then lists specific values for Optimiser, LR Schedule, Weight Decay, Warmup Steps, Initial LR, Resolution, Epochs, and Batch Size for pre-training and various fine-tuning tasks.