reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multimodal Chain-of-Thought Reasoning in Language Models

Authors: Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, Alex Smola

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on Science QA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal Co T, our model under 1 billion parameters achieves state-of-the-art performance on the Science QA benchmark. Our analysis indicates that Multimodal-Co T offers the advantages of mitigating hallucination and enhancing convergence speed.
Researcher Affiliation	Collaboration	Zhuosheng Zhang EMAIL School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University Aston Zhang EMAIL Gen AI, Meta Mu Li EMAIL Amazon Web Services Hai Zhao EMAIL Department of Computer Science and Engineering, Shanghai Jiao Tong University George Karypis EMAIL Amazon Web Services Alex Smola EMAIL Amazon Web Services
Pseudocode	No	The paper describes methods and procedures using prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks. For example, Section 4.2 'Model Architecture' describes the encoding, interaction, and decoding steps in text.
Open Source Code	Yes	Code is publicly available at https://github.com/amazon-science/mm-cot.
Open Datasets	Yes	Our method is evaluated on the Science QA (Lu et al., 2022a) and A-OKVQA (Schwenk et al., 2022) benchmark datasets. We choose those datasets because they are latest multimodal reasoning benchmarks with annotated reasoning chains.
Dataset Splits	Yes	Science QA is a large-scale multimoda science question dataset with annotated lectures and explanations. It contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills. There are 12k, 4k, and 4k questions in the training, validation, and test splits, respectively. A-OKVQA is a knowledge-based visual question answering benchmark, which has 25k questions requiring a broad base of commonsense and world knowledge to answer. It has 17k/1k/6k questions for train/val/test.
Hardware Specification	Yes	Our experiments are run on 8 NVIDIA Tesla V100 32G GPUs.
Software Dependencies	No	The paper mentions using T5 encoder-decoder architecture, FLAN-Alpaca, Vi T-large encoder, and Instruct BLIP, but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA, which are typically required for replication.
Experiment Setup	Yes	We fine-tune the models up to 20 epochs, with a learning rate selected in {5e-5, 8e-5}. The maximum input sequence lengths for rationale generation and answer inference are 512 and 64, respectively. The batch size is 8.