reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tool Unlearning for Tool-Augmented LLMs

Authors: Jiali Cheng, Hadi Amiri

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple tool learning datasets and tool-augmented LLMs show that TOOLDELETE effectively unlearns both randomly selected and class-specific tools, while preserving knowledge on remaining tools and maintaining performance on general tasks.
Researcher Affiliation	Academia	1University of Massachusetts Lowell, USA. Correspondence to: Jiali Cheng <jiali EMAIL>, Hadi Amiri <hadi EMAIL>.
Pseudocode	No	The paper describes the TOOLDELETE framework with mathematical formulations and detailed textual explanations of its properties and training details. However, it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	No	The paper refers to public checkpoints of tool-augmented LLMs on Huggingface (Tang Qiao Yu/Tool Alpaca-7B, Tool Bench/Tool LLa MA-2-7b-v2, gorilla-llm/gorilla-openfunctions-v0) as starting points for unlearning. However, it does not provide any explicit statement or link for the source code of the proposed TOOLDELETE methodology itself.
Open Datasets	Yes	We experiment with the following datasets and their corresponding LLMs: Tool Alpaca (Tang et al., 2023) is an agent-generated tool learning dataset consisting of 495 tools and 3975 training examples. [...] Tool Bench (Qin et al., 2024) consists of more than 16k real world APIs from 49 categories [...] API-Bench (Patil et al., 2023) focus on APIs that load machine learning models.
Dataset Splits	Yes	Then we conduct unlearning experiments with 2 20% tools randomly selected as Tf.
Hardware Specification	Yes	All experiments are conducted on 8 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions specific models like 'Vicuna-v1.3', 'LLa MA-2 7B', and 'LLa MA 7B', and references a 'Python transformers package' in an example. However, it does not list specific software dependencies with their version numbers required to replicate the experimental setup.
Experiment Setup	Yes	We use a learning rate of 10 5 across all experiments.