Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Authors: Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: Mini CPM-V-8.5B and Qwen2-VL-7B, which outperforms untrained VLMs by 20%, showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.
Researcher Affiliation Academia 1School of Intelligence Science and Technology, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI 3Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology 4Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University 5Department of Automation, Tsinghua University
Pseudocode No The paper describes its methodology in natural language and illustrates agent interactions with 'Thought:', 'Code:', and 'Observation:' examples (Figures 1, 4, 5). However, it does not include a dedicated section or figure explicitly labeled 'Pseudocode' or 'Algorithm' that outlines the proposed method in a structured, code-like format.
Open Source Code Yes mat-agent.github.io
Open Datasets Yes With the data generation pipeline, we construct MM-Traj, a dataset that contains 20K multi-modal tasks with tool-usage trajectories. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. ... we compile about 93K image-captioning pairs from 8 source datasets, including Chart QA (Masry et al., 2022), COCO (Lin et al., 2014), LLa VA (Wang et al., 2023a), SAM (Kirillov et al., 2023), Text VQA (Singh et al., 2019), Web Celebrity (Liu et al., 2015), Web-Landmark (Weyand et al., 2020), and Wiki Art (Saleh & Elgammal, 2015).
Dataset Splits No To preserve the visual perception and reasoning capabilities of Mini CPM-V and Qwen2VL, we combine the training data in MM-Traj with the data in Cauldron (Lindström & Abraham, 2022) and open-LLa Va-Ne XT (Chen, 2024) datasets. We train 5 epoch over all data. ... The paper mentions using the validation set of the GAIA benchmark, but does not provide specific training/test/validation splits for its own constructed MM-Traj dataset or for the combined training data used for tuning.
Hardware Specification Yes Table 9 in section B.2 (DATA NUMBER) provides hardware information: 'Memory 214 GB'.
Software Dependencies No The paper mentions using Python packages such as 'matplotlib', 'opencv', 'pandas', 'numpy', and models like 'Mini CPM-V-8.5B' and 'Qwen2-VL-7B', but it does not specify the version numbers for these software dependencies, which are required for reproducible descriptions.
Experiment Setup Yes To preserve the visual perception and reasoning capabilities of Mini CPM-V and Qwen2VL, we combine the training data in MM-Traj with the data in Cauldron (Lindström & Abraham, 2022) and open-LLa Va-Ne XT (Chen, 2024) datasets. We train 5 epoch over all data. In the training process of our VLM controller, we freeze the vision encoder and visual token compressor, and fine-tune the language model using Lo RA (Hu et al., 2022). We set the rank as 64 and apply Lo RA on query, key, and value projection matrices in all self-attention layers. We use the Adam W optimizer with a cosine annealing scheduler. The learning rate is 1e 6 and the batch size is 2. We set the max context window to 10240 to support the long trajectory of our agent.