Imagine While Reasoning in Space: Multimodal Visualization-of-Thought

Authors: Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, Furu Wei

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments and ablation studies across three spatial reasoning tasks with newly collected datasets, demonstrating that MVo T exhibits superior adaptability and robustness compared to Co T in complex scenarios.
Researcher Affiliation Collaboration 1Language Technology Lab, University of Cambridge 2Microsoft Research 3Institute of Automation, Chinese Academy of Sciences.
Pseudocode No The paper describes methods using mathematical formulations (Equations 1-5) and textual descriptions, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We will release the code and the datasets at URL-ANONYMOUS upon acceptance for reproducibility purposes.
Open Datasets No We will release the code and the datasets at URL-ANONYMOUS upon acceptance for reproducibility purposes.
Dataset Splits Yes The dataset statistics are presented in Table 4. Detailed information on data collection is provided in App. B. ... Table 4. Statistics of the collected datasets, covering varying levels of complexity in actions and patterns. ... Train Set Size 5007 6400 6846 Test Set Size 1255 1604 1664
Hardware Specification Yes All models were trained on MI300X GPUs.
Software Dependencies Yes For GPT-4o, we utilized the 2024-07-01 version hosted on the Azure platform, with inference parameters outlined in Table 9.
Experiment Setup Yes Table 8 and 9 show the hyper-parameters for training MVo T and doing inference with GPT-4o. ... Table 8. Hyper-parameters of fine-tuning Anole 7B for different system variants. Random Seed 42 Epochs 40 Learning Rate 0.0002 Train Batch Size 4 Val Batch Size 16 8 Grad Accumulation 4 2 GPUs 8 32