reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Effective Evaluations and Comparisons for LLM Unlearning Methods

Authors: Qizhou Wang, Bo Han, Puning Yang, Jianing ZHU, Tongliang Liu, Masashi Sugiyama

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation framework notably enhances the effectiveness when assessing and comparing various LLM unlearning methods, further allowing us to benchmark existing works, identify their proper hyper-parameters, and explore new tricks to enhance their practical efficacy. The code is publicly available at: https://github.com/tmlr-group/Unlearning-with-Control. ... 6 EXPERIMENTS We benchmark existing LLM unlearning methods using UWC, recommending their proper hyperparameters, assessing and comparing their efficacy in achieving effective unlearning. ... We report not only the ES scores for original data but also for the associated paraphrased versions provided by TOFU.
Researcher Affiliation	Academia	1TMLR Group, Department of Computer Science, Hong Kong Baptist University 2RIKEN Center for Advanced Intelligence Project 3Sydney AI Center, The University of Sydney 4The University of Tokyo
Pseudocode	Yes	Algorithm 1 Binary Search for MM Calibration
Open Source Code	Yes	The code is publicly available at: https://github.com/tmlr-group/Unlearning-with-Control.
Open Datasets	Yes	Our evaluations were based on the well-established benchmarks of TOFU fictitious unlearning (Maini et al., 2024), focusing on LLMs fine-tuned with a series of fictitious authors profiles.
Dataset Splits	Yes	For the unlearning setups, the original TOFU data were separated into targeted and non-targeted parts, of which the adopted proportions are 1:99 (1% unlearning), 5:95 (5% unlearning), and 10:90 (10% unlearning).
Hardware Specification	Yes	All our experiments were realized by Transformers 4.42.4 with CUDA 12.1, using a series of computation nodes equipped with NVIDIA-A100-80GB GPUs and Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz Processors.
Software Dependencies	Yes	All our experiments were realized by Transformers 4.42.4 with CUDA 12.1, using a series of computation nodes equipped with NVIDIA-A100-80GB GPUs and Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz Processors.
Experiment Setup	Yes	For all the considered methods, we adopt the following implementation setups: the Adam W optimizer (Loshchilov & Hutter, 2017), the initial learning rate 2e 5 for Phi-1.5 and 1e 5 for Llama-2-7B, the batch size 16 for both the targeted and non-targeted data, the epoch number 5, and the linear warm-up for the first epoch. For MM calibration, we set τ = 0.95 for Phi-1.5 and τ = 0.90 for Llama-2-7B.