ToolGen: Unified Tool Retrieval and Calling via Generation
Authors: Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, Haonan Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results with over 47,000 tools show that Tool Gen not only achieves superior results in both tool retrieval and autonomous task completion but also sets the stage for a new era of AI agents that can adapt to tools across diverse domains. |
| Researcher Affiliation | Collaboration | 1Libr AI 2Mohamed bin Zayed University of Artificial Intelligence 3Microsoft 4University of California, Los Angeles 5The University of Melbourne EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Constrained Beam Search |
| Open Source Code | Yes | Data and code are available at https://github.com/Reason-Wang/Tool Gen |
| Open Datasets | Yes | Our experiments are based on Tool Bench, a real-world tool benchmark containing more 16k tool collections, each containing several APIs, resulting in a total of 47k unique APIs. Each API is documented with a dictionary, containing the name, description, and parameters for calling the API. A real example is shown in Appendix C. We take each API as an action and map it to a token. Our retrieval and end-to-end agent-tuning data are converted from the original data in Tool Bench. Details can be found in Appendix K. Although each tool may consist of multiple APIs, for simplicity, we refer to each API as a tool in this paper. We follow the data split of Qin et al. (2023), where 200k (query, relevant API) pairs are divided into three categories: I1 (single-tool queries), I2 (intra-category multi-tool queries), and I3 (intra-collection multi-tool instructions), containing 87,413, 84,815, and 25,251 instances, respectively. |
| Dataset Splits | Yes | We follow the data split of Qin et al. (2023), where 200k (query, relevant API) pairs are divided into three categories: I1 (single-tool queries), I2 (intra-category multi-tool queries), and I3 (intra-collection multi-tool instructions), containing 87,413, 84,815, and 25,251 instances, respectively. |
| Hardware Specification | Yes | All models are trained using Deepspeed Ze RO 3 (Rajbhandari et al., 2020) across 4 A100 GPUs. |
| Software Dependencies | No | The paper mentions 'Llama-3-8B' as the foundation model, 'Deepspeed Ze RO 3' for training, and 'Flash Attention'. While specific software components are named, explicit version numbers for all key dependencies (e.g., Python, PyTorch versions, or specific versions for Deepspeed and Flash Attention) are not provided to meet the reproducibility criteria. |
| Experiment Setup | Yes | We fine-tune the model using the Llama-3 chat template with a cosine learning rate scheduler, applying a 3% warm-up steps. The maximum learning is 4 10 5. All models are trained using Deepspeed Ze RO 3 (Rajbhandari et al., 2020) across 4 A100 GPUs. We train 8 epochs for tool memorization and 1 epoch for retrieval training. Context length is truncated to 6,144. The total batch size is set to 512. |