CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning

Authors: Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Xu Bin, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on 9 benchmarks from 4 categories, including Text VQA (Singh et al., 2019), ST-VQA (Biten et al., 2019), Tally VQA (Acharya et al., 2019), and GQA Hudson & Manning (2019) for detailed visual question answering, Ref COCO (Yu et al., 2016), Ref COCO+(Yu et al., 2016), and Ref COCOg (Mao et al., 2016) for visual grounding, POPE (Li et al., 2023d) for hallucination validation, and MM-Vet (Yu et al., 2023b) for general multimodal ability. Our model achieves up to 9.0 and 1.09 accuracy improvement on the detailed VQA and grounding benchmarks, respectively, and the superior performance on the general multimodal benchmark. The results demonstrate the effectiveness of the mechanism while maintaining interpretability.
Researcher Affiliation Collaboration Tsinghua University Zhipu AI Work done when JQ, WW, YB, and WH interned at Zhipu AI. Corresponding authors: BX and JT (xubin | EMAIL)
Pseudocode Yes We provide the pseudocode of the Co M synthesis algorithm to clearly explain the process of data generation, thereby facilitating understanding and reproduction 1. Algorithm 1 Synthesising Chain of Manipulations
Open Source Code Yes Code, model, and data are available at https://github.com/THUDM/Cog Co M.
Open Datasets Yes Following the pre-training of Cog VLM (Wang et al., 2023a), we first train model on 1.5B image-text pairs cleaned from the LAION-2B (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022) with 120,000 iterations and batch size of 8,192. We then train model on 40M grounded image-question-answer triples cleaned from LAION-115M (Li et al., 2023b) with 60,000 iterations and batch size of 1,024, where each noun phrase in the answer is followed by a list of coordinates [[x0, y0, x1, y1], ...]3 referring the phrase to the grounded objects in the image. ... We implement this pipeline on 3 existing datasets that require detailed recognition or counting, Text VQA (Singh et al., 2019), ST-VQA (Biten et al., 2019), and TDIUC (Shrestha et al., 2019), to build 70K Co M samples 2. ... We perform this annotation on the Math Vista (Lu et al., 2023) and Chart QA (Masry et al., 2022), which include geometric and chart math problems, resulting in the collection of 7K high-quality Co M math samples. ... We fuse the produced Co M data with 3 types of corpus, including Multi Instruct (Xu et al., 2022), LLa VAR (Zhang et al., 2023b), and Share GPT4V (Chen et al., 2023b), referring the abilities of instruction-following, texts-recognizing, and detailed-captioning.
Dataset Splits Yes We use the official evaluation scripts for GQA and Tally VQA, which calculate the accuracy score by the Exact Matching (EM) between model predictions and answers. For Text VQA and ST-VQA, we submit our model predictions to the official online websites for calculating the accuracy with VQA Score metric (Antol et al., 2015). ... Due to the lack of resources, we build Co M-test, a benchmark with Co M reasoning chains on the Text VQA test set based on the proposed data generation pipeline, and also introduce a keypoints-aware metric to validate the correctness of reasoning paths (see Appendix E.3 for detailed statistics).
Hardware Specification Yes Table 6: Training details of all stages. Hardware Environment: 3,840 A100xdays (Stage1-1), 256 A100xdays (Stage1-2), 160 A100xdays (Stage-2)
Software Dependencies Yes Concretely, the pre-trained EVA2-CLIP-E (Sun et al., 2023a) with 4B parameters and Vicuna-7B-v1.5 (Chiang et al., 2023) are adopted as the visual encoder and LLM backbone, respectively.
Experiment Setup Yes Table 6: Training details of all stages. Objective: next token prediction (all stages) Images: 1.5B (Stage1-1), 40M (Stage1-2), 576K (Stage-2) Batch size: 8192 (Stage1-1), 1024 (Stage1-2), 160 (Stage-2) Iterations: 120,000 (Stage1-1), 60000 (Stage1-2), 14000 (Stage-2) Optimizer: Adam W (all stages) Learning rate: 1e-4 (Stage1-1), 1e-5 (Stage1-2), 1e-5 (Stage-2) Warm up steps: 7200 (Stage1-1), 1200 (Stage1-2), 280 (Stage-2) Trainable weights: 6.5B visual expert (all stages)