Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Authors: Minheng Ni, YuTao Fan, Lei Zhang, Wangmeng Zuo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our method not only enhances the performance of models of different intelligence levels on ambiguous instructions but also improves their performance on general datasets. We conduct a series of experiments. We apply VISUAL-O1 to the state-of-the-art high-intelligence model and general-intelligence model on two typical multi-modal tasks: referring image segmentation (RIS) and visual question answering (VQA). All comparisons are divided into ambiguous instructions and general instructions to comprehensively evaluate the models performance on ambiguous and non-ambiguous instructions. Baselines and Evaluation Metrics. We select a series of typical methods as baselines. For RIS, we compare gIoU and cIoU, and for VQA, we compare accuracy and BLEU-1. Ablation studies demonstrate that VISUAL-O1 can be easily applied to different multi-modal models and tasks.
Researcher Affiliation Academia 1Department of Computing, Hong Kong Polytechnic University EMAIL EMAIL 2Faculty of Computing, Harbin Institute of Technology 3Pengcheng Laboratory EMAIL EMAIL
Pseudocode No The paper describes the proposed framework using mathematical equations (Eq. 1-9) and conceptual diagrams (Figure 2). It does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We release our data and code at https://github.com/kodenii/Visual-O1.
Open Datasets Yes We also construct a dataset containing various types of ambiguous instructions, including ellipsis, colloquialism, subjectivity, relativity, and other, to validate performance across different multi-modal scenarios. We will release the data we set up under the MIT license. For RIS, we use the REFCOCO+ dataset (Kazemzadeh et al., 2014); for VQA, we use the VIZWIZ dataset (Gurari et al., 2018).
Dataset Splits No The paper mentions creating "subsets of 150, 650, and 106, respectively" for ambiguous instructions for RIS and VQA. For VLN, it states using the "valid unseen split of the ROOM-TO-ROOM dataset (Anderson et al., 2018)". However, it does not provide specific percentages or counts for training, testing, or validation splits for the main RIS and VQA experiments, nor does it explicitly define how the created ambiguous instruction subsets are used in splits.
Hardware Specification No The paper mentions VRAM usage in Table K (e.g., "88797MB" and "16148MB") in the context of computational overhead. However, it does not specify the exact GPU or CPU models used for these experiments, which is required for a reproducible hardware specification.
Software Dependencies No The paper mentions various models like GPT-4O, LLAVA, LISA, and SOM, but it does not specify the version numbers of any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the experiments.
Experiment Setup Yes We run experiments on 10 random seeds to obtain average results. We set budgets Nins and Nemp to 10 and 3 for general-intelligent and high-intelligent models respectively.