reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Authors: Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: Mini CPM-V-8.5B and Qwen2-VL-7B, which outperforms untrained VLMs by 20%, showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.
Researcher Affiliation	Academia	1School of Intelligence Science and Technology, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI 3Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology 4Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University 5Department of Automation, Tsinghua University
Pseudocode	No	The paper describes its methodology in natural language and illustrates agent interactions with 'Thought:', 'Code:', and 'Observation:' examples (Figures 1, 4, 5). However, it does not include a dedicated section or figure explicitly labeled 'Pseudocode' or 'Algorithm' that outlines the proposed method in a structured, code-like format.
Open Source Code	Yes	mat-agent.github.io
Open Datasets	Yes	With the data generation pipeline, we construct MM-Traj, a dataset that contains 20K multi-modal tasks with tool-usage trajectories. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. ... we compile about 93K image-captioning pairs from 8 source datasets, including Chart QA (Masry et al., 2022), COCO (Lin et al., 2014), LLa VA (Wang et al., 2023a), SAM (Kirillov et al., 2023), Text VQA (Singh et al., 2019), Web Celebrity (Liu et al., 2015), Web-Landmark (Weyand et al., 2020), and Wiki Art (Saleh & Elgammal, 2015).
Dataset Splits	No	To preserve the visual perception and reasoning capabilities of Mini CPM-V and Qwen2VL, we combine the training data in MM-Traj with the data in Cauldron (Lindström & Abraham, 2022) and open-LLa Va-Ne XT (Chen, 2024) datasets. We train 5 epoch over all data. ... The paper mentions using the validation set of the GAIA benchmark, but does not provide specific training/test/validation splits for its own constructed MM-Traj dataset or for the combined training data used for tuning.
Hardware Specification	Yes	Table 9 in section B.2 (DATA NUMBER) provides hardware information: 'Memory 214 GB'.
Software Dependencies	No	The paper mentions using Python packages such as 'matplotlib', 'opencv', 'pandas', 'numpy', and models like 'Mini CPM-V-8.5B' and 'Qwen2-VL-7B', but it does not specify the version numbers for these software dependencies, which are required for reproducible descriptions.
Experiment Setup	Yes	To preserve the visual perception and reasoning capabilities of Mini CPM-V and Qwen2VL, we combine the training data in MM-Traj with the data in Cauldron (Lindström & Abraham, 2022) and open-LLa Va-Ne XT (Chen, 2024) datasets. We train 5 epoch over all data. In the training process of our VLM controller, we freeze the vision encoder and visual token compressor, and fine-tune the language model using Lo RA (Hu et al., 2022). We set the rank as 64 and apply Lo RA on query, key, and value projection matrices in all self-attention layers. We use the Adam W optimizer with a cosine annealing scheduler. The learning rate is 1e 6 and the batch size is 2. We set the max context window to 10240 to support the long trajectory of our agent.