reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Authors: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through rigorous testing across 9 proprietary LMM APIs and 9 open models (18 in total), we demonstrate the considerable yet still developing visual agent capabilities of these models.
Researcher Affiliation	Academia	1Tsinghua University, 2The Ohio State University, 3Zhejiang University, 4Peking University
Pseudocode	Yes	You are an intelligent agent exceling at solving household tasks. You are in a household environment given a task to finish. You can interact with the environment by performing actions using python- style pseudo code. For each turn, please call exactly one predefined action. (Appendix B.4)
Open Source Code	Yes	Code, train, and test data are available at https://github.com/THUDM/Visual Agent Bench.
Open Datasets	Yes	Code, train, and test data are available at https://github.com/THUDM/Visual Agent Bench. VAB strives to offer the first multitask multi-environment trajectory train set for developing LMM agents, containing 4,482 high-quality training trajectories spanning 5 environments.
Dataset Splits	Yes	Table 2: Statistics of all datasets in VAB. #Test Instance #Train Trajectory ... VAB-Omni Gibson 181 872 ... VAB-Minecraft 116 382 ... VAB-Android Lab 119 1,213 ... VAB-Web Arena-Lite 165 1,186 ... VAB-CSS 165 829
Hardware Specification	No	Omni Gibson has no friendly interface for humans to operate on, and requires high-end laptops with GPUs supporting ray tracing and large main memory (> 10 GB) to run. ... With larger backbone LLMs (insufficiently tested here due to computing resource limitations) ...
Software Dependencies	No	On the one hand, there have been a mature web automation tool Playwright that supports Python. ... Other hyperparameters are configured using the default ones provided by the model s original repository or the third-party s integrated training framework.
Experiment Setup	Yes	All models undergo full-parameter fine-tuning for 5k steps with batch size 64, with CSS data duplicated to improve adaptation to the screenshot format. See details in Appendix A.5.