Multimodal Chain-of-Thought Reasoning in Language Models
Authors: Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, Alex Smola
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on Science QA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal Co T, our model under 1 billion parameters achieves state-of-the-art performance on the Science QA benchmark. Our analysis indicates that Multimodal-Co T offers the advantages of mitigating hallucination and enhancing convergence speed. |
| Researcher Affiliation | Collaboration | Zhuosheng Zhang EMAIL School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University Aston Zhang EMAIL Gen AI, Meta Mu Li EMAIL Amazon Web Services Hai Zhao EMAIL Department of Computer Science and Engineering, Shanghai Jiao Tong University George Karypis EMAIL Amazon Web Services Alex Smola EMAIL Amazon Web Services |
| Pseudocode | No | The paper describes methods and procedures using prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks. For example, Section 4.2 'Model Architecture' describes the encoding, interaction, and decoding steps in text. |
| Open Source Code | Yes | Code is publicly available at https://github.com/amazon-science/mm-cot. |
| Open Datasets | Yes | Our method is evaluated on the Science QA (Lu et al., 2022a) and A-OKVQA (Schwenk et al., 2022) benchmark datasets. We choose those datasets because they are latest multimodal reasoning benchmarks with annotated reasoning chains. |
| Dataset Splits | Yes | Science QA is a large-scale multimoda science question dataset with annotated lectures and explanations. It contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills. There are 12k, 4k, and 4k questions in the training, validation, and test splits, respectively. A-OKVQA is a knowledge-based visual question answering benchmark, which has 25k questions requiring a broad base of commonsense and world knowledge to answer. It has 17k/1k/6k questions for train/val/test. |
| Hardware Specification | Yes | Our experiments are run on 8 NVIDIA Tesla V100 32G GPUs. |
| Software Dependencies | No | The paper mentions using T5 encoder-decoder architecture, FLAN-Alpaca, Vi T-large encoder, and Instruct BLIP, but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA, which are typically required for replication. |
| Experiment Setup | Yes | We fine-tune the models up to 20 epochs, with a learning rate selected in {5e-5, 8e-5}. The maximum input sequence lengths for rationale generation and answer inference are 512 and 64, respectively. The batch size is 8. |