reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Offset Unlearning for Large Language Models

Authors: James Y. Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, Muhao Chen

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that δUnlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks. We evaluate the effectiveness of δ-Unlearning on TOFU Maini et al. (2024), a widely used LLM unlearning benchmark containing knowledge about fictitious authors. Our experimental results on TOFU are shown in Tab. 2.
Researcher Affiliation	Collaboration	James Y. Huang EMAIL University of Southern California, Sheng Zhang EMAIL Microsoft Research, Muhao Chen EMAIL University of California, Davis
Pseudocode	No	No specific pseudocode or algorithm block is present. The methodology is described in text and mathematical formulas.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We conduct our experiments on TOFU Maini et al. (2024), a widely used LLM unlearning benchmark designed for evaluating LLMs. In addition to TOFU, we assess if the unlearned model preserves general utilities on well-established benchmarks, including ARC Clark et al. (2018), Hella Swag Zellers et al. (2019), Wino Grande Sakaguchi et al. (2021) and Open Book QA Mihaylov et al. (2018).
Dataset Splits	No	The paper defines different 'sets' for evaluation (Forget Set, Retain Set, Real Author, World Fact) from the TOFU benchmark, but does not provide specific training, validation, or test dataset splits or their percentages, or explicit methodology for creating these splits.
Hardware Specification	Yes	All models are trained using NVIDIA A100 GPUs for 5 epochs with a batch size of 32.
Software Dependencies	No	The paper mentions using specific Llama2 models (Llama2-13b-chat-hf and Llama2-7b-chat-hf) but does not provide explicit version numbers for programming languages or libraries used for implementation.
Experiment Setup	Yes	All models are trained using NVIDIA A100 GPUs for 5 epochs with a batch size of 32. We set α to 1 for our experiments. Following Yao et al. (2024), we match all models to the target ROUGE score by adjusting the learning rate.