ToolACE: Winning the Points of LLM Function Calling
Authors: Weiwen Liu, Xu Huang, Xingshan Zeng, xinlong hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong WANG, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Wang Xinzhi, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on two widely adopted benchmarks: BFCL Yan et al. (2024) and APIBank Li et al. (2023). With only 8B parameters, Tool ACE significantly outperforms existing open-source LLMs and is competitive with the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University 2Huawei Noah s Ark Lab 3University of Science and Technology of China 4Huawei Technologies Co., Ltd 5Tsinghua University 6The Chinese University of Hong Kong |
| Pseudocode | No | The paper describes methods like Tool Self-evolution Synthesis (TSS), Self-Guided Dialog Generation (SDG), and Dual-Layer Validation Process (DLV) in prose, and illustrates them with architectural diagrams (Figure 1), and examples of rules (Table 4) and case studies (Figures 10-16). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code for any of its procedures. |
| Open Source Code | Yes | Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE. |
| Open Datasets | Yes | We conduct experiments on two widely adopted benchmarks: BFCL Yan et al. (2024) and APIBank Li et al. (2023). The two benchmarks are comprehensive and executable function call evaluations specifically designed to assess the ability of LLMs to invoke functions. |
| Dataset Splits | Yes | To effectively assess the impact of dataset complexity on the model s performance, we have conducted a sampling of the entire dataset based on the aforementioned complexity assessment metrics. We compute and sort the complexity for each data sample using Eq. (1), and select the bottom, middle, and top 60,000 instancess as Tool ACEeasy, Tool ACEmedium, Tool ACEhard, respectively, yielding three distinct subsets of varying complexity levels... Approximately 30,000 instances are randomly selected from each subset, resulting in three training sets with distinct levels of diversity. BFCL contains 4,951 test cases: 3,951 single-turn cases and 1,000 multi-turn cases |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, memory, or cloud instances) used for running the experiments. It refers to training LLMs like LLa MA3.1-8B-Instruct and Qwen-1.5-x B-Chat series, but without detailing the underlying hardware. |
| Software Dependencies | No | The paper mentions fine-tuning using Lo RA and various LLM backbones such as LLa MA3.1-8B-Instruct and Qwen-1.5-x B-Chat series. However, it does not provide specific versions for any ancillary software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries (e.g., CUDA) that would be needed to replicate the experiments. |
| Experiment Setup | Yes | Table 5: Hyper-parameters in experiments for training. Learning Rate: 10^-4 Warm Up Ratio: 0.1 LR Scheduler: cosine Batch Size: 48 Epochs: 3 Lo RA rank: 16 Lo RA alpha: 32 |