reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Function-Calling for On-Device Language Model via Function Masking

Authors: Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, Jun Wang, Weinan Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evaluations reveal that Hammer not only outperforms larger models but also demonstrates robust generalization across diverse benchmarks, achieving state-of-the-art results. Our open-source contributions include a specialized dataset for irrelevance detection, a tuning framework for enhanced generalization, and the Hammer models, establishing a new standard for function-calling performance. In this section, we show the superiority of our Hammers in performance and robustness across various benchmarks as well as in-depth analysis to verify the effectiveness of our augmented dataset and approach.
Researcher Affiliation	Collaboration	1OPPO Research Institute 2Shanghai Jiao Tong University 3Iowa State University EMAIL EMAIL
Pseudocode	No	The paper does not contain any explicit sections or figures labeled 'Pseudocode' or 'Algorithm', nor does it present structured code-like steps for any method or procedure. Figures 1 and 3 are diagrams, and Appendix I provides input/output examples, not algorithmic pseudocode.
Open Source Code	Yes	The source code is available at https://github.com/Made Agents/Hammer, while the augmented dataset and models can be accessed at https://huggingface.co/Made Agents. The latest release, the Hammer 2.1 models, significantly improves multi-turn and multi-step function calling capabilities.
Open Datasets	Yes	Our open-source contributions include a specialized dataset for irrelevance detection... while the augmented dataset and models can be accessed at https://huggingface.co/Made Agents. To assess the generalizability of Hammers, we conducted evaluations using a variety of function-calling benchmarks, all of which represent out-of-domain challenges for our model. The Berkeley Function-Calling Leaderboard (BFCL) (Yan et al., 2024)... API-Bank (Li et al., 2023)... Nexus Raven API Evaluation (Srinivasan et al., 2023)... Tool-Alpaca (Tang et al., 2023)... Seal-Tools (Wu et al., 2024)...
Dataset Splits	Yes	To assess the generalizability of Hammers, we conducted evaluations using a variety of function-calling benchmarks... For evaluation, we utilized 100 simulated test examples from this dataset [Tool-Alpaca]... The Berkeley Function-Calling Leaderboard (BFCL) (Yan et al., 2024) provides a comprehensive dataset comprising over 1,700 instances. We systematically applied different masking ratios while fine-tuning the Qwen2-1.5B model on the Seal-Tools training dataset for one epoch. Subsequently, we evaluated the performance of the models trained with different masking ratios on the test sets of both Seal-Tools and API-Bank.
Hardware Specification	Yes	Table 15 presents a detailed evaluation of non-functional metrics and hardware configurations for our Hammer-7B model, after it has been quantized using the Q4 K M method... Processor: Snapdragon 8 Gen 3 Mobile Model: OPPO Find X7 Ultra
Software Dependencies	No	The paper mentions various models, techniques, and implicitly programming languages (e.g., PyTorch, Python, Llama-3, Qwen) but does not provide specific version numbers for any software libraries, frameworks, or environments used in the experiments. The Q4 K M quantization method is a technique, not a software dependency with a version.
Experiment Setup	Yes	Hammer enjoys a masking ratio of 0.33 before each training epoch, which yields the best overall performance across all benchmarks. To improve the models ability to determine whether the user s intent aligns with the available function calls, we augment the x LAM-function-calling-60k dataset (Liu et al., 2024b) with an additional 7,500 instances specifically tailored for irrelevance detection. The Hammer model achieves optimal overall performance when the proportion of irrelevance-augmented data is approximately 10%.