reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

Authors: Pei Wang, Yanan Wu, Zekun Wang, JIAHENG LIU, Xiaoshuai Song, Z.Y. Peng, ken deng, Chenchen Zhang, JiakaiWang, Junran Peng, Ge Zhang, Hangyu Guo, Zhaoxiang Zhang, wenbo su, Bo Zheng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench. Code and data will be released at https://github.com/MTU-Bench-Team/MTU-Bench.git.
Researcher Affiliation	Collaboration	1Alibaba Group, 2Nanjing University, 3University of Chinese Academy of Sciences, 4University of Waterloo EMAIL
Pseudocode	No	The paper provides several "Prompt Template" blocks and "System Prompt" blocks, which describe instructions given to LLMs. While these are structured, they are prompts for generating data or evaluating models, not traditional pseudocode or algorithms describing the methodology's computational steps in an algorithmic format.
Open Source Code	Yes	Code and data will be released at https://github.com/MTU-Bench-Team/MTU-Bench.git.
Open Datasets	Yes	To improve the diversity of our dataset, we collect several open-source task-oriented dialogue datasets as our data sources. These datasets focus on dialogues for specific tasks such as flight reservations or movie bookings, which are highly suitable for synthesizing tool-use data. The multi-turn dialogue datasets include Multi WOZ (Budzianowski et al., 2018), SGD (Rastogi et al., 2020b), Task Master (Byrne et al., 2019) and Meta LWOZ (Shalyminov et al., 2020). The single-turn dialogue datasets include ATIS (Hemphill et al., 1990) and SNIPS (Siddhant et al., 2018).
Dataset Splits	Yes	we split the MTU-Bench data into training and testing splits, involving 54798 dialogues in total, as well as 136 tools. In our MTU-Eval, we propose a series of fine-grained metrics... we pick out a hard subset from the test split to include more complex tool-use scenarios... Figure 4: Statistics of MTU-Bench. #Dialogues (Train/Test) 54,367 / 431. Table 7: The number of dialogues under different settings. Test Setting Train normal hard S-S 14277 52 56 S-M 13641 55 39 M-S 19007 54 31 M-M 7442 42 37
Hardware Specification	No	The paper lists various LLMs used for evaluation (e.g., GPT-4, LLaMA3-8B), but it does not specify any hardware details (like GPU models, CPU types, or memory) on which these experiments were conducted or fine-tuning was performed for their own model, MTU-LLaMA.
Software Dependencies	No	The paper mentions various large language models (LLMs) such as GPT-4, LLaMA3, Qwen-Max, etc., as well as the fine-tuning of MTU-LLaMA based on LLaMA3-8B-Instruct. However, it does not specify any programming languages, libraries, or other software components with their respective version numbers that would be necessary to replicate their methodology or benchmark setup.
Experiment Setup	No	The paper states that "MTU-LLaMA, which is fine-tuned on MTU-Instruct based on LLaMA3-8B-Instruct." This indicates a fine-tuning process but does not provide any specific experimental setup details such as learning rates, batch sizes, number of epochs, or optimizer configurations which are crucial for reproducibility.