Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning

Authors: Mingyang Chen, sunhaoze, Tianpeng Li, Fan Yang, Hao Liang, KeerLu, Bin CUI, Wentao Zhang, Zenan Zhou, Weipeng Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We produce a dataset BUTTONInstruct comprising 8k data points and demonstrate its effectiveness through extensive experiments across various LLMs. To evaluate the effectiveness of our multi-turn function calling data BUTTONInstruct collected via our proposed BUTTON pipeline, we train two series of open-source LLMs of different sizes: Llama3-8B, Llama3-70B (Dubey et al., 2024), Qwen2-7B, and Qwen2-72B (Yang et al., 2024a).
Researcher Affiliation Collaboration 1Baichuan Inc., 2 Peking University EMAIL EMAIL
Pseudocode No The paper describes the 'BOTTOM-UP THEN TOP-DOWN pipeline, denoted as BUTTON' and details its stages ('bottom-up instruction construction' and 'top-down trajectory generation') through narrative text and figures, but does not present any formal pseudocode or algorithm blocks.
Open Source Code No The data is available at https://github.com/PKU-Baichuan-MLSystem Lab/BUTTON. Explanation: The paper explicitly states that 'The data is available' at the provided link, but it does not make an unambiguous statement that the *code for the methodology* (the BUTTON pipeline) is open-source or available.
Open Datasets Yes We produce a dataset BUTTONInstruct comprising 8k data points and demonstrate its effectiveness through extensive experiments across various LLMs 1. The data is available at https://github.com/PKU-Baichuan-MLSystem Lab/BUTTON. In our work, the seed data for scenario extraction is derived from glaive-function-calling-v2 (glaiveai, 2023) and Tool LLama datasets (Qin et al., 2023).
Dataset Splits Yes Benchmarks. We evaluate performance using two benchmarks, GTA and Tool-Query. GTA (Wang et al., 2024b), a benchmark for General Tool Agents, consists of 229 human-crafted queries... The number of test samples excluding these questions is 209. Tool-Query (Ma et al., 2024) is a tool-using environment... It consists of 60 tasks... Tasks are also labeled as hard or easy based on the number of subgoals...
Hardware Specification Yes The models are trained on 4 8 NVIDIA H800 GPUs.
Software Dependencies No All instruction-tuning training is performed on 4 8 NVIDIA H800 GPUs, using the training framework based on Hugging Face Transformers (Wolf et al., 2019). Explanation: The paper mentions 'Hugging Face Transformers' but does not provide specific version numbers for it or any other key software components like Python or PyTorch, which would be necessary for reproduction.
Experiment Setup Yes During model training, we optimize the loss only on the response content from assistant roles. We use a learning rate of 2e-5 with cosine decay and a batch size of 64 for all models. For Llama3-8B and Qwen2-7B, we train for five epochs, and for Llama3-70B and Qwen2-72B, we train for two epochs.