Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
Authors: Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: Mini CPM-V-8.5B and Qwen2-VL-7B, which outperforms untrained VLMs by 20%, showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities. |
| Researcher Affiliation | Academia | 1School of Intelligence Science and Technology, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI 3Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology 4Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University 5Department of Automation, Tsinghua University |
| Pseudocode | No | The paper describes its methodology in natural language and illustrates agent interactions with 'Thought:', 'Code:', and 'Observation:' examples (Figures 1, 4, 5). However, it does not include a dedicated section or figure explicitly labeled 'Pseudocode' or 'Algorithm' that outlines the proposed method in a structured, code-like format. |
| Open Source Code | Yes | mat-agent.github.io |
| Open Datasets | Yes | With the data generation pipeline, we construct MM-Traj, a dataset that contains 20K multi-modal tasks with tool-usage trajectories. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. ... we compile about 93K image-captioning pairs from 8 source datasets, including Chart QA (Masry et al., 2022), COCO (Lin et al., 2014), LLa VA (Wang et al., 2023a), SAM (Kirillov et al., 2023), Text VQA (Singh et al., 2019), Web Celebrity (Liu et al., 2015), Web-Landmark (Weyand et al., 2020), and Wiki Art (Saleh & Elgammal, 2015). |
| Dataset Splits | No | To preserve the visual perception and reasoning capabilities of Mini CPM-V and Qwen2VL, we combine the training data in MM-Traj with the data in Cauldron (Lindström & Abraham, 2022) and open-LLa Va-Ne XT (Chen, 2024) datasets. We train 5 epoch over all data. ... The paper mentions using the validation set of the GAIA benchmark, but does not provide specific training/test/validation splits for its own constructed MM-Traj dataset or for the combined training data used for tuning. |
| Hardware Specification | Yes | Table 9 in section B.2 (DATA NUMBER) provides hardware information: 'Memory 214 GB'. |
| Software Dependencies | No | The paper mentions using Python packages such as 'matplotlib', 'opencv', 'pandas', 'numpy', and models like 'Mini CPM-V-8.5B' and 'Qwen2-VL-7B', but it does not specify the version numbers for these software dependencies, which are required for reproducible descriptions. |
| Experiment Setup | Yes | To preserve the visual perception and reasoning capabilities of Mini CPM-V and Qwen2VL, we combine the training data in MM-Traj with the data in Cauldron (Lindström & Abraham, 2022) and open-LLa Va-Ne XT (Chen, 2024) datasets. We train 5 epoch over all data. In the training process of our VLM controller, we freeze the vision encoder and visual token compressor, and fine-tune the language model using Lo RA (Hu et al., 2022). We set the rank as 64 and apply Lo RA on query, key, and value projection matrices in all self-attention layers. We use the Adam W optimizer with a cosine annealing scheduler. The learning rate is 1e 6 and the batch size is 2. We set the max context window to 10240 to support the long trajectory of our agent. |