Robust Function-Calling for On-Device Language Model via Function Masking
Authors: Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, Jun Wang, Weinan Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluations reveal that Hammer not only outperforms larger models but also demonstrates robust generalization across diverse benchmarks, achieving state-of-the-art results. Our open-source contributions include a specialized dataset for irrelevance detection, a tuning framework for enhanced generalization, and the Hammer models, establishing a new standard for function-calling performance. In this section, we show the superiority of our Hammers in performance and robustness across various benchmarks as well as in-depth analysis to verify the effectiveness of our augmented dataset and approach. |
| Researcher Affiliation | Collaboration | 1OPPO Research Institute 2Shanghai Jiao Tong University 3Iowa State University EMAIL EMAIL |
| Pseudocode | No | The paper does not contain any explicit sections or figures labeled 'Pseudocode' or 'Algorithm', nor does it present structured code-like steps for any method or procedure. Figures 1 and 3 are diagrams, and Appendix I provides input/output examples, not algorithmic pseudocode. |
| Open Source Code | Yes | The source code is available at https://github.com/Made Agents/Hammer, while the augmented dataset and models can be accessed at https://huggingface.co/Made Agents. The latest release, the Hammer 2.1 models, significantly improves multi-turn and multi-step function calling capabilities. |
| Open Datasets | Yes | Our open-source contributions include a specialized dataset for irrelevance detection... while the augmented dataset and models can be accessed at https://huggingface.co/Made Agents. To assess the generalizability of Hammers, we conducted evaluations using a variety of function-calling benchmarks, all of which represent out-of-domain challenges for our model. The Berkeley Function-Calling Leaderboard (BFCL) (Yan et al., 2024)... API-Bank (Li et al., 2023)... Nexus Raven API Evaluation (Srinivasan et al., 2023)... Tool-Alpaca (Tang et al., 2023)... Seal-Tools (Wu et al., 2024)... |
| Dataset Splits | Yes | To assess the generalizability of Hammers, we conducted evaluations using a variety of function-calling benchmarks... For evaluation, we utilized 100 simulated test examples from this dataset [Tool-Alpaca]... The Berkeley Function-Calling Leaderboard (BFCL) (Yan et al., 2024) provides a comprehensive dataset comprising over 1,700 instances. We systematically applied different masking ratios while fine-tuning the Qwen2-1.5B model on the Seal-Tools training dataset for one epoch. Subsequently, we evaluated the performance of the models trained with different masking ratios on the test sets of both Seal-Tools and API-Bank. |
| Hardware Specification | Yes | Table 15 presents a detailed evaluation of non-functional metrics and hardware configurations for our Hammer-7B model, after it has been quantized using the Q4 K M method... Processor: Snapdragon 8 Gen 3 Mobile Model: OPPO Find X7 Ultra |
| Software Dependencies | No | The paper mentions various models, techniques, and implicitly programming languages (e.g., PyTorch, Python, Llama-3, Qwen) but does not provide specific version numbers for any software libraries, frameworks, or environments used in the experiments. The Q4 K M quantization method is a technique, not a software dependency with a version. |
| Experiment Setup | Yes | Hammer enjoys a masking ratio of 0.33 before each training epoch, which yields the best overall performance across all benchmarks. To improve the models ability to determine whether the user s intent aligns with the available function calls, we augment the x LAM-function-calling-60k dataset (Liu et al., 2024b) with an additional 7,500 instances specifically tailored for irrelevance detection. The Hammer model achieves optimal overall performance when the proportion of irrelevance-augmented data is approximately 10%. |