reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Authors: Weicheng Kuo, AJ Piergiovanni, Dahun Kim, xiyang luo, Benjamin Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew M. Dai, Zhifeng Chen, Claire Cui, Anelia Angelova

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations conﬁrm the ﬂexibility and advantages of our approach. 5 Experiments 6 Ablation studies
Researcher Affiliation	Industry	Weicheng Kuo AJ Piergiovanni Dahun Kim Xiyang Luo Ben Cain Wei Li Abhijit Ogale Luowei Zhou Andrew Dai Zhifeng Chen Claire Cui Anelia Angelova Google Research Correspondence to EMAIL.
Pseudocode	No	The paper describes the methodology narratively and mathematically (e.g., Equation 5) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an unambiguous statement of code release or a link to a code repository. The Open Review link provided is for the review process, not for source code.
Open Datasets	Yes	We evaluate the performance of Ma MMUT on Zero-shot image-text retrieval tasks. Table 1 shows the image-to-text and text-to-image results, compared to the SOTA methods on two popular retrieval benchmarks MS COCO (Chen et al., 2015) and Flickr (Plummer et al., 2015). We report the performance on the VQAv2 benchmark (Agrawal et al., 2015) in Table 2. The results are presented in Table 4a, comparing to the state of the art on MSRVTT-QA (Jun Xu & Rui, 2016) and MSVD-QA (Xu et al., 2017) datasets. Video captioning results on the MSRVTT (Jun Xu & Rui, 2016) and MSVD (Chen & Dolan, 2011; Xu et al., 2017) datasets are presented in Table 4b. We evaluate our work on the challenging LVIS dataset (Gupta et al., 2019).
Dataset Splits	Yes	MS COCO (5K test set) Flickr30K (1K test set) model -image-to-text -text-to-image -image-to-text -text-to-image [...] We report the performance on the VQAv2 benchmark (Agrawal et al., 2015) in Table 2. [...] Method Test-Dev Test-Std [...] The results are presented in Table 4a, comparing to the state of the art on MSRVTT-QA (Jun Xu & Rui, 2016) and MSVD-QA (Xu et al., 2017) datasets. Video captioning results on the MSRVTT (Jun Xu & Rui, 2016) and MSVD (Chen & Dolan, 2011; Xu et al., 2017) datasets are presented in Table 4b. The model is trained with base categories, and tested to detect novel category objects at inference.
Hardware Specification	No	The paper describes the model architecture and training configurations but does not provide specific details about the hardware (e.g., GPU models, CPU types, or TPU versions) used for running the experiments.
Software Dependencies	No	The paper mentions using the 'Adam W optimizer' and 'Sentencepiece tokenizer' but does not specify their version numbers or other software dependencies with specific versions.
Experiment Setup	Yes	The large model is trained for 500K steps using a batch size of 16K. We use Adam W optimizer with weight decay value 0.01. Our initial learning rate is 0.001, and both generative and contrastive loss weights are set to 1.0. We ﬁrst resize every image to 272x272 and randomly crop a 224x224 patch out for pretraining. We apply 10K warmup steps before applying linear LR decay to the end of training. The temperature in contrastive learning is learnable and initialized to 1.0.