reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ToolACE: Winning the Points of LLM Function Calling

Authors: Weiwen Liu, Xu Huang, Xingshan Zeng, xinlong hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong WANG, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Wang Xinzhi, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on two widely adopted benchmarks: BFCL Yan et al. (2024) and APIBank Li et al. (2023). With only 8B parameters, Tool ACE significantly outperforms existing open-source LLMs and is competitive with the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE.
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University 2Huawei Noah s Ark Lab 3University of Science and Technology of China 4Huawei Technologies Co., Ltd 5Tsinghua University 6The Chinese University of Hong Kong
Pseudocode	No	The paper describes methods like Tool Self-evolution Synthesis (TSS), Self-Guided Dialog Generation (SDG), and Dual-Layer Validation Process (DLV) in prose, and illustrates them with architectural diagrams (Figure 1), and examples of rules (Table 4) and case studies (Figures 10-16). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code for any of its procedures.
Open Source Code	Yes	Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE.
Open Datasets	Yes	We conduct experiments on two widely adopted benchmarks: BFCL Yan et al. (2024) and APIBank Li et al. (2023). The two benchmarks are comprehensive and executable function call evaluations specifically designed to assess the ability of LLMs to invoke functions.
Dataset Splits	Yes	To effectively assess the impact of dataset complexity on the model s performance, we have conducted a sampling of the entire dataset based on the aforementioned complexity assessment metrics. We compute and sort the complexity for each data sample using Eq. (1), and select the bottom, middle, and top 60,000 instancess as Tool ACEeasy, Tool ACEmedium, Tool ACEhard, respectively, yielding three distinct subsets of varying complexity levels... Approximately 30,000 instances are randomly selected from each subset, resulting in three training sets with distinct levels of diversity. BFCL contains 4,951 test cases: 3,951 single-turn cases and 1,000 multi-turn cases
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, memory, or cloud instances) used for running the experiments. It refers to training LLMs like LLa MA3.1-8B-Instruct and Qwen-1.5-x B-Chat series, but without detailing the underlying hardware.
Software Dependencies	No	The paper mentions fine-tuning using Lo RA and various LLM backbones such as LLa MA3.1-8B-Instruct and Qwen-1.5-x B-Chat series. However, it does not provide specific versions for any ancillary software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries (e.g., CUDA) that would be needed to replicate the experiments.
Experiment Setup	Yes	Table 5: Hyper-parameters in experiments for training. Learning Rate: 10^-4 Warm Up Ratio: 0.1 LR Scheduler: cosine Batch Size: 48 Epochs: 3 Lo RA rank: 16 Lo RA alpha: 32