Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning
Authors: Mingyang Chen, sunhaoze, Tianpeng Li, Fan Yang, Hao Liang, KeerLu, Bin CUI, Wentao Zhang, Zenan Zhou, Weipeng Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We produce a dataset BUTTONInstruct comprising 8k data points and demonstrate its effectiveness through extensive experiments across various LLMs. To evaluate the effectiveness of our multi-turn function calling data BUTTONInstruct collected via our proposed BUTTON pipeline, we train two series of open-source LLMs of different sizes: Llama3-8B, Llama3-70B (Dubey et al., 2024), Qwen2-7B, and Qwen2-72B (Yang et al., 2024a). |
| Researcher Affiliation | Collaboration | 1Baichuan Inc., 2 Peking University EMAIL EMAIL |
| Pseudocode | No | The paper describes the 'BOTTOM-UP THEN TOP-DOWN pipeline, denoted as BUTTON' and details its stages ('bottom-up instruction construction' and 'top-down trajectory generation') through narrative text and figures, but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | No | The data is available at https://github.com/PKU-Baichuan-MLSystem Lab/BUTTON. Explanation: The paper explicitly states that 'The data is available' at the provided link, but it does not make an unambiguous statement that the *code for the methodology* (the BUTTON pipeline) is open-source or available. |
| Open Datasets | Yes | We produce a dataset BUTTONInstruct comprising 8k data points and demonstrate its effectiveness through extensive experiments across various LLMs 1. The data is available at https://github.com/PKU-Baichuan-MLSystem Lab/BUTTON. In our work, the seed data for scenario extraction is derived from glaive-function-calling-v2 (glaiveai, 2023) and Tool LLama datasets (Qin et al., 2023). |
| Dataset Splits | Yes | Benchmarks. We evaluate performance using two benchmarks, GTA and Tool-Query. GTA (Wang et al., 2024b), a benchmark for General Tool Agents, consists of 229 human-crafted queries... The number of test samples excluding these questions is 209. Tool-Query (Ma et al., 2024) is a tool-using environment... It consists of 60 tasks... Tasks are also labeled as hard or easy based on the number of subgoals... |
| Hardware Specification | Yes | The models are trained on 4 8 NVIDIA H800 GPUs. |
| Software Dependencies | No | All instruction-tuning training is performed on 4 8 NVIDIA H800 GPUs, using the training framework based on Hugging Face Transformers (Wolf et al., 2019). Explanation: The paper mentions 'Hugging Face Transformers' but does not provide specific version numbers for it or any other key software components like Python or PyTorch, which would be necessary for reproduction. |
| Experiment Setup | Yes | During model training, we optimize the loss only on the response content from assistant roles. We use a learning rate of 2e-5 with cosine decay and a batch size of 64 for all models. For Llama3-8B and Qwen2-7B, we train for five epochs, and for Llama3-70B and Qwen2-72B, we train for two epochs. |