Reducing Tool Hallucination via Reliability Alignment

Authors: Hongshen Xu, Zichen Zhu, Lei Pan, Zihan Wang, Su Zhu, Da Ma, Ruisheng Cao, Lu Chen, Kai Yu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate that Relign significantly reduces tool hallucinations, improves task reliability, and enhances the efficiency of LLM tool interactions. The code and data are publicly available at https://github.com/ X-LANCE/Tool Hallucination. As shown in Table 2, we assessed the performance of our model on three subsets of Stable Toolbench. Our primary focus was on measuring the tool hallucination rate and the task pass rate.
Researcher Affiliation Collaboration 1X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China. 4AISpeech Co., Ltd., Suzhou, China.
Pseudocode No The paper describes the Relign framework and its components (tool-level alignment objectives, data synthesis, training pipeline) in prose within Section 3, 'Reliability Alignment', without presenting a formal pseudocode or algorithm block.
Open Source Code Yes The code and data are publicly available at https://github.com/ X-LANCE/Tool Hallucination.
Open Datasets Yes Datasets. In our experiments, we primarily utilized three datasets: Tool Bench, Stable Tool Bench and Rely Tool Bench. Tool Bench (Qin et al., 2023b) was constructed by scraping various APIs from Rapid API and generating corresponding tasks and executions, comprising a total of 120,000 data samples. Stable Tool Bench (Guo et al., 2024) is a selected subset of solvable samples from Tool Bench, and it also proposed a stable environment for evaluation. We randomly select 10,000 samples from Toolbench for constructing reliability alignment data. We further construct Rely Tool Bench based on Stable Tool Bench for evalution.
Dataset Splits Yes To construct the SFT training data, we randomly split 10,000 samples into three subsets: 4,000 remain unchanged, 3,000 are modified by replacing tool choices, and 3,000 are used to construct missing parameter cases.
Hardware Specification Yes We conduct all experiments using Nvidia A800 GPUs.
Software Dependencies Yes We use LLAMA-3.18B-INSTRUCT, QWEN-2.5-7B-INSTRUCT, and TOOLLLAMA 7B (Qin et al., 2023b) as our experimental models. ... We utilize the DEEPSPEEDCHAT framework for efficient model training.
Experiment Setup Yes For all the experiments, we set the training batch size to 32, and the max sequence length to 8192. We utilize the DEEPSPEEDCHAT framework for efficient model training. In all methods, the learning rate is set to 1e-5 for SFT and 1e-5 for DPO to ensure consistency, with all training conducted over two epochs.