reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reducing Tool Hallucination via Reliability Alignment

Authors: Hongshen Xu, Zichen Zhu, Lei Pan, Zihan Wang, Su Zhu, Da Ma, Ruisheng Cao, Lu Chen, Kai Yu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we demonstrate that Relign significantly reduces tool hallucinations, improves task reliability, and enhances the efficiency of LLM tool interactions. The code and data are publicly available at https://github.com/ X-LANCE/Tool Hallucination. As shown in Table 2, we assessed the performance of our model on three subsets of Stable Toolbench. Our primary focus was on measuring the tool hallucination rate and the task pass rate.
Researcher Affiliation	Collaboration	1X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China. 4AISpeech Co., Ltd., Suzhou, China.
Pseudocode	No	The paper describes the Relign framework and its components (tool-level alignment objectives, data synthesis, training pipeline) in prose within Section 3, 'Reliability Alignment', without presenting a formal pseudocode or algorithm block.
Open Source Code	Yes	The code and data are publicly available at https://github.com/ X-LANCE/Tool Hallucination.
Open Datasets	Yes	Datasets. In our experiments, we primarily utilized three datasets: Tool Bench, Stable Tool Bench and Rely Tool Bench. Tool Bench (Qin et al., 2023b) was constructed by scraping various APIs from Rapid API and generating corresponding tasks and executions, comprising a total of 120,000 data samples. Stable Tool Bench (Guo et al., 2024) is a selected subset of solvable samples from Tool Bench, and it also proposed a stable environment for evaluation. We randomly select 10,000 samples from Toolbench for constructing reliability alignment data. We further construct Rely Tool Bench based on Stable Tool Bench for evalution.
Dataset Splits	Yes	To construct the SFT training data, we randomly split 10,000 samples into three subsets: 4,000 remain unchanged, 3,000 are modified by replacing tool choices, and 3,000 are used to construct missing parameter cases.
Hardware Specification	Yes	We conduct all experiments using Nvidia A800 GPUs.
Software Dependencies	Yes	We use LLAMA-3.18B-INSTRUCT, QWEN-2.5-7B-INSTRUCT, and TOOLLLAMA 7B (Qin et al., 2023b) as our experimental models. ... We utilize the DEEPSPEEDCHAT framework for efficient model training.
Experiment Setup	Yes	For all the experiments, we set the training batch size to 32, and the max sequence length to 8192. We utilize the DEEPSPEEDCHAT framework for efficient model training. In all methods, the learning rate is set to 1e-5 for SFT and 1e-5 for DPO to ensure consistency, with all training conducted over two epochs.