reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning

Authors: Mingyang Chen, sunhaoze, Tianpeng Li, Fan Yang, Hao Liang, KeerLu, Bin CUI, Wentao Zhang, Zenan Zhou, Weipeng Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We produce a dataset BUTTONInstruct comprising 8k data points and demonstrate its effectiveness through extensive experiments across various LLMs. To evaluate the effectiveness of our multi-turn function calling data BUTTONInstruct collected via our proposed BUTTON pipeline, we train two series of open-source LLMs of different sizes: Llama3-8B, Llama3-70B (Dubey et al., 2024), Qwen2-7B, and Qwen2-72B (Yang et al., 2024a).
Researcher Affiliation	Collaboration	1Baichuan Inc., 2 Peking University EMAIL EMAIL
Pseudocode	No	The paper describes the 'BOTTOM-UP THEN TOP-DOWN pipeline, denoted as BUTTON' and details its stages ('bottom-up instruction construction' and 'top-down trajectory generation') through narrative text and figures, but does not present any formal pseudocode or algorithm blocks.
Open Source Code	No	The data is available at https://github.com/PKU-Baichuan-MLSystem Lab/BUTTON. Explanation: The paper explicitly states that 'The data is available' at the provided link, but it does not make an unambiguous statement that the code for the methodology (the BUTTON pipeline) is open-source or available.
Open Datasets	Yes	We produce a dataset BUTTONInstruct comprising 8k data points and demonstrate its effectiveness through extensive experiments across various LLMs 1. The data is available at https://github.com/PKU-Baichuan-MLSystem Lab/BUTTON. In our work, the seed data for scenario extraction is derived from glaive-function-calling-v2 (glaiveai, 2023) and Tool LLama datasets (Qin et al., 2023).
Dataset Splits	Yes	Benchmarks. We evaluate performance using two benchmarks, GTA and Tool-Query. GTA (Wang et al., 2024b), a benchmark for General Tool Agents, consists of 229 human-crafted queries... The number of test samples excluding these questions is 209. Tool-Query (Ma et al., 2024) is a tool-using environment... It consists of 60 tasks... Tasks are also labeled as hard or easy based on the number of subgoals...
Hardware Specification	Yes	The models are trained on 4 8 NVIDIA H800 GPUs.
Software Dependencies	No	All instruction-tuning training is performed on 4 8 NVIDIA H800 GPUs, using the training framework based on Hugging Face Transformers (Wolf et al., 2019). Explanation: The paper mentions 'Hugging Face Transformers' but does not provide specific version numbers for it or any other key software components like Python or PyTorch, which would be necessary for reproduction.
Experiment Setup	Yes	During model training, we optimize the loss only on the response content from assistant roles. We use a learning rate of 2e-5 with cosine decay and a batch size of 64 for all models. For Llama3-8B and Qwen2-7B, we train for five epochs, and for Llama3-70B and Qwen2-72B, we train for two epochs.