reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Authors: Minheng Ni, YuTao Fan, Lei Zhang, Wangmeng Zuo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our method not only enhances the performance of models of different intelligence levels on ambiguous instructions but also improves their performance on general datasets. We conduct a series of experiments. We apply VISUAL-O1 to the state-of-the-art high-intelligence model and general-intelligence model on two typical multi-modal tasks: referring image segmentation (RIS) and visual question answering (VQA). All comparisons are divided into ambiguous instructions and general instructions to comprehensively evaluate the models performance on ambiguous and non-ambiguous instructions. Baselines and Evaluation Metrics. We select a series of typical methods as baselines. For RIS, we compare gIoU and cIoU, and for VQA, we compare accuracy and BLEU-1. Ablation studies demonstrate that VISUAL-O1 can be easily applied to different multi-modal models and tasks.
Researcher Affiliation	Academia	1Department of Computing, Hong Kong Polytechnic University EMAIL EMAIL 2Faculty of Computing, Harbin Institute of Technology 3Pengcheng Laboratory EMAIL EMAIL
Pseudocode	No	The paper describes the proposed framework using mathematical equations (Eq. 1-9) and conceptual diagrams (Figure 2). It does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release our data and code at https://github.com/kodenii/Visual-O1.
Open Datasets	Yes	We also construct a dataset containing various types of ambiguous instructions, including ellipsis, colloquialism, subjectivity, relativity, and other, to validate performance across different multi-modal scenarios. We will release the data we set up under the MIT license. For RIS, we use the REFCOCO+ dataset (Kazemzadeh et al., 2014); for VQA, we use the VIZWIZ dataset (Gurari et al., 2018).
Dataset Splits	No	The paper mentions creating "subsets of 150, 650, and 106, respectively" for ambiguous instructions for RIS and VQA. For VLN, it states using the "valid unseen split of the ROOM-TO-ROOM dataset (Anderson et al., 2018)". However, it does not provide specific percentages or counts for training, testing, or validation splits for the main RIS and VQA experiments, nor does it explicitly define how the created ambiguous instruction subsets are used in splits.
Hardware Specification	No	The paper mentions VRAM usage in Table K (e.g., "88797MB" and "16148MB") in the context of computational overhead. However, it does not specify the exact GPU or CPU models used for these experiments, which is required for a reproducible hardware specification.
Software Dependencies	No	The paper mentions various models like GPT-4O, LLAVA, LISA, and SOM, but it does not specify the version numbers of any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the experiments.
Experiment Setup	Yes	We run experiments on 10 random seeds to obtain average results. We set budgets Nins and Nemp to 10 and 3 for general-intelligent and high-intelligent models respectively.