reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ToolGen: Unified Tool Retrieval and Calling via Generation

Authors: Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, Haonan Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results with over 47,000 tools show that Tool Gen not only achieves superior results in both tool retrieval and autonomous task completion but also sets the stage for a new era of AI agents that can adapt to tools across diverse domains.
Researcher Affiliation	Collaboration	1Libr AI 2Mohamed bin Zayed University of Artificial Intelligence 3Microsoft 4University of California, Los Angeles 5The University of Melbourne EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Constrained Beam Search
Open Source Code	Yes	Data and code are available at https://github.com/Reason-Wang/Tool Gen
Open Datasets	Yes	Our experiments are based on Tool Bench, a real-world tool benchmark containing more 16k tool collections, each containing several APIs, resulting in a total of 47k unique APIs. Each API is documented with a dictionary, containing the name, description, and parameters for calling the API. A real example is shown in Appendix C. We take each API as an action and map it to a token. Our retrieval and end-to-end agent-tuning data are converted from the original data in Tool Bench. Details can be found in Appendix K. Although each tool may consist of multiple APIs, for simplicity, we refer to each API as a tool in this paper. We follow the data split of Qin et al. (2023), where 200k (query, relevant API) pairs are divided into three categories: I1 (single-tool queries), I2 (intra-category multi-tool queries), and I3 (intra-collection multi-tool instructions), containing 87,413, 84,815, and 25,251 instances, respectively.
Dataset Splits	Yes	We follow the data split of Qin et al. (2023), where 200k (query, relevant API) pairs are divided into three categories: I1 (single-tool queries), I2 (intra-category multi-tool queries), and I3 (intra-collection multi-tool instructions), containing 87,413, 84,815, and 25,251 instances, respectively.
Hardware Specification	Yes	All models are trained using Deepspeed Ze RO 3 (Rajbhandari et al., 2020) across 4 A100 GPUs.
Software Dependencies	No	The paper mentions 'Llama-3-8B' as the foundation model, 'Deepspeed Ze RO 3' for training, and 'Flash Attention'. While specific software components are named, explicit version numbers for all key dependencies (e.g., Python, PyTorch versions, or specific versions for Deepspeed and Flash Attention) are not provided to meet the reproducibility criteria.
Experiment Setup	Yes	We fine-tune the model using the Llama-3 chat template with a cosine learning rate scheduler, applying a 3% warm-up steps. The maximum learning is 4 10 5. All models are trained using Deepspeed Ze RO 3 (Rajbhandari et al., 2020) across 4 A100 GPUs. We train 8 epochs for tool memorization and 1 epoch for retrieval training. Context length is truncated to 6,144. The total batch size is set to 512.