MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Authors: Weicheng Kuo, AJ Piergiovanni, Dahun Kim, xiyang luo, Benjamin Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew M. Dai, Zhifeng Chen, Claire Cui, Anelia Angelova

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach. 5 Experiments 6 Ablation studies
Researcher Affiliation Industry Weicheng Kuo AJ Piergiovanni Dahun Kim Xiyang Luo Ben Cain Wei Li Abhijit Ogale Luowei Zhou Andrew Dai Zhifeng Chen Claire Cui Anelia Angelova Google Research Correspondence to EMAIL.
Pseudocode No The paper describes the methodology narratively and mathematically (e.g., Equation 5) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an unambiguous statement of code release or a link to a code repository. The Open Review link provided is for the review process, not for source code.
Open Datasets Yes We evaluate the performance of Ma MMUT on Zero-shot image-text retrieval tasks. Table 1 shows the image-to-text and text-to-image results, compared to the SOTA methods on two popular retrieval benchmarks MS COCO (Chen et al., 2015) and Flickr (Plummer et al., 2015). We report the performance on the VQAv2 benchmark (Agrawal et al., 2015) in Table 2. The results are presented in Table 4a, comparing to the state of the art on MSRVTT-QA (Jun Xu & Rui, 2016) and MSVD-QA (Xu et al., 2017) datasets. Video captioning results on the MSRVTT (Jun Xu & Rui, 2016) and MSVD (Chen & Dolan, 2011; Xu et al., 2017) datasets are presented in Table 4b. We evaluate our work on the challenging LVIS dataset (Gupta et al., 2019).
Dataset Splits Yes MS COCO (5K test set) Flickr30K (1K test set) model -image-to-text -text-to-image -image-to-text -text-to-image [...] We report the performance on the VQAv2 benchmark (Agrawal et al., 2015) in Table 2. [...] Method Test-Dev Test-Std [...] The results are presented in Table 4a, comparing to the state of the art on MSRVTT-QA (Jun Xu & Rui, 2016) and MSVD-QA (Xu et al., 2017) datasets. Video captioning results on the MSRVTT (Jun Xu & Rui, 2016) and MSVD (Chen & Dolan, 2011; Xu et al., 2017) datasets are presented in Table 4b. The model is trained with base categories, and tested to detect novel category objects at inference.
Hardware Specification No The paper describes the model architecture and training configurations but does not provide specific details about the hardware (e.g., GPU models, CPU types, or TPU versions) used for running the experiments.
Software Dependencies No The paper mentions using the 'Adam W optimizer' and 'Sentencepiece tokenizer' but does not specify their version numbers or other software dependencies with specific versions.
Experiment Setup Yes The large model is trained for 500K steps using a batch size of 16K. We use Adam W optimizer with weight decay value 0.01. Our initial learning rate is 0.001, and both generative and contrastive loss weights are set to 1.0. We first resize every image to 272x272 and randomly crop a 224x224 patch out for pretraining. We apply 10K warmup steps before applying linear LR decay to the end of training. The temperature in contrastive learning is learnable and initialized to 1.0.