Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Multimodal Few-Shot Learning with Frozen Language Models
Authors: Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, Felix Hill
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments are designed to quantify three capacities that should be characteristic of a Multi Modal Few-Shot Learner: rapid adaptation to new tasks, fast access to general knowledge and fast binding of visual and linguistic elements. We quantify these capabilities on a range of existing and new benchmarks, paving the way for future analysis of these capabilities. |
| Researcher Affiliation | Collaboration | Maria Tsimpoukelli Deep Mind EMAIL Jacob Menick Deep Mind University College London EMAIL Serkan Cabi Deep Mind EMAIL S. M. Ali Eslami Deep Mind EMAIL Oriol Vinyals Deep Mind EMAIL Felix Hill Deep Mind EMAIL |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The Open-Ended mini Imagenet, Real-Name mini Imagenet, Fast-VQA and Guided-VQA evaluation sets are available to download at https://fh295.github.io/frozen.html. This link is for evaluation datasets, not the source code for the methodology. |
| Open Datasets | Yes | We use a 7 billion parameter transformer trained on the public dataset C4 [31] previous work has shown that the multi-billion parameter scale is sufficient to exhibit the key capacities we are interested in studying [30, 34]. During training, we update only the parameters φ of the vision encoder using paired image-caption data from the Conceptual Captions dataset [37]. |
| Dataset Splits | Yes | We do early stopping on the validation set perplexity which usually reaches an optimum just after a single epoch with batch size 128. We evaluate on the VQAv2 [10] validation set. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or TPU versions) used for running experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions software components like 'Sentence Piece tokenizer' and 'Adam optimizer' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | All experiments used the Adam optimizer with β1 = 0.9 and β2 = 0.95 and a constant learning rate of 3e-4 unless otherwise noted. We do early stopping on the validation set perplexity which usually reaches an optimum just after a single epoch with batch size 128. We experimented using different number of tokens k, specifically 1, 2 and 4 and found that 2 performs best, though certainly this would be sensitive to other architectural details. We operate on 224 224 images at both train and test-time. Images which are not square are first padded with zeroes to square and then resized to 224 224. |