VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Authors: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through rigorous testing across 9 proprietary LMM APIs and 9 open models (18 in total), we demonstrate the considerable yet still developing visual agent capabilities of these models.
Researcher Affiliation Academia 1Tsinghua University, 2The Ohio State University, 3Zhejiang University, 4Peking University
Pseudocode Yes You are an intelligent agent exceling at solving household tasks. You are in a household environment given a task to finish. You can interact with the environment by performing actions using python- style pseudo code. For each turn, please call exactly one predefined action. (Appendix B.4)
Open Source Code Yes Code, train, and test data are available at https://github.com/THUDM/Visual Agent Bench.
Open Datasets Yes Code, train, and test data are available at https://github.com/THUDM/Visual Agent Bench. VAB strives to offer the first multitask multi-environment trajectory train set for developing LMM agents, containing 4,482 high-quality training trajectories spanning 5 environments.
Dataset Splits Yes Table 2: Statistics of all datasets in VAB. #Test Instance #Train Trajectory ... VAB-Omni Gibson 181 872 ... VAB-Minecraft 116 382 ... VAB-Android Lab 119 1,213 ... VAB-Web Arena-Lite 165 1,186 ... VAB-CSS 165 829
Hardware Specification No Omni Gibson has no friendly interface for humans to operate on, and requires high-end laptops with GPUs supporting ray tracing and large main memory (> 10 GB) to run. ... With larger backbone LLMs (insufficiently tested here due to computing resource limitations) ...
Software Dependencies No On the one hand, there have been a mature web automation tool Playwright that supports Python. ... Other hyperparameters are configured using the default ones provided by the model s original repository or the third-party s integrated training framework.
Experiment Setup Yes All models undergo full-parameter fine-tuning for 5k steps with batch size 64, with CSS data duplicated to improve adaptation to the screenshot format. See details in Appendix A.5.