VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Authors: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through rigorous testing across 9 proprietary LMM APIs and 9 open models (18 in total), we demonstrate the considerable yet still developing visual agent capabilities of these models. |
| Researcher Affiliation | Academia | 1Tsinghua University, 2The Ohio State University, 3Zhejiang University, 4Peking University |
| Pseudocode | Yes | You are an intelligent agent exceling at solving household tasks. You are in a household environment given a task to finish. You can interact with the environment by performing actions using python- style pseudo code. For each turn, please call exactly one predefined action. (Appendix B.4) |
| Open Source Code | Yes | Code, train, and test data are available at https://github.com/THUDM/Visual Agent Bench. |
| Open Datasets | Yes | Code, train, and test data are available at https://github.com/THUDM/Visual Agent Bench. VAB strives to offer the first multitask multi-environment trajectory train set for developing LMM agents, containing 4,482 high-quality training trajectories spanning 5 environments. |
| Dataset Splits | Yes | Table 2: Statistics of all datasets in VAB. #Test Instance #Train Trajectory ... VAB-Omni Gibson 181 872 ... VAB-Minecraft 116 382 ... VAB-Android Lab 119 1,213 ... VAB-Web Arena-Lite 165 1,186 ... VAB-CSS 165 829 |
| Hardware Specification | No | Omni Gibson has no friendly interface for humans to operate on, and requires high-end laptops with GPUs supporting ray tracing and large main memory (> 10 GB) to run. ... With larger backbone LLMs (insufficiently tested here due to computing resource limitations) ... |
| Software Dependencies | No | On the one hand, there have been a mature web automation tool Playwright that supports Python. ... Other hyperparameters are configured using the default ones provided by the model s original repository or the third-party s integrated training framework. |
| Experiment Setup | Yes | All models undergo full-parameter fine-tuning for 5k steps with batch size 64, with CSS data duplicated to improve adaptation to the screenshot format. See details in Appendix A.5. |