MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
Authors: Pei Wang, Yanan Wu, Zekun Wang, JIAHENG LIU, Xiaoshuai Song, Z.Y. Peng, ken deng, Chenchen Zhang, JiakaiWang, Junran Peng, Ge Zhang, Hangyu Guo, Zhaoxiang Zhang, wenbo su, Bo Zheng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench. Code and data will be released at https://github.com/MTU-Bench-Team/MTU-Bench.git. |
| Researcher Affiliation | Collaboration | 1Alibaba Group, 2Nanjing University, 3University of Chinese Academy of Sciences, 4University of Waterloo EMAIL |
| Pseudocode | No | The paper provides several "Prompt Template" blocks and "System Prompt" blocks, which describe instructions given to LLMs. While these are structured, they are prompts for generating data or evaluating models, not traditional pseudocode or algorithms describing the methodology's computational steps in an algorithmic format. |
| Open Source Code | Yes | Code and data will be released at https://github.com/MTU-Bench-Team/MTU-Bench.git. |
| Open Datasets | Yes | To improve the diversity of our dataset, we collect several open-source task-oriented dialogue datasets as our data sources. These datasets focus on dialogues for specific tasks such as flight reservations or movie bookings, which are highly suitable for synthesizing tool-use data. The multi-turn dialogue datasets include Multi WOZ (Budzianowski et al., 2018), SGD (Rastogi et al., 2020b), Task Master (Byrne et al., 2019) and Meta LWOZ (Shalyminov et al., 2020). The single-turn dialogue datasets include ATIS (Hemphill et al., 1990) and SNIPS (Siddhant et al., 2018). |
| Dataset Splits | Yes | we split the MTU-Bench data into training and testing splits, involving 54798 dialogues in total, as well as 136 tools. In our MTU-Eval, we propose a series of fine-grained metrics... we pick out a hard subset from the test split to include more complex tool-use scenarios... Figure 4: Statistics of MTU-Bench. #Dialogues (Train/Test) 54,367 / 431. Table 7: The number of dialogues under different settings. Test Setting Train normal hard S-S 14277 52 56 S-M 13641 55 39 M-S 19007 54 31 M-M 7442 42 37 |
| Hardware Specification | No | The paper lists various LLMs used for evaluation (e.g., GPT-4, LLaMA3-8B), but it does not specify any hardware details (like GPU models, CPU types, or memory) on which these experiments were conducted or fine-tuning was performed for their own model, MTU-LLaMA. |
| Software Dependencies | No | The paper mentions various large language models (LLMs) such as GPT-4, LLaMA3, Qwen-Max, etc., as well as the fine-tuning of MTU-LLaMA based on LLaMA3-8B-Instruct. However, it does not specify any programming languages, libraries, or other software components with their respective version numbers that would be necessary to replicate their methodology or benchmark setup. |
| Experiment Setup | No | The paper states that "MTU-LLaMA, which is fine-tuned on MTU-Instruct based on LLaMA3-8B-Instruct." This indicates a fine-tuning process but does not provide any specific experimental setup details such as learning rates, batch sizes, number of epochs, or optimizer configurations which are crucial for reproducibility. |