reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Selective Forgetting: Advancing Machine Unlearning Techniques and Evaluation in Language Models

Authors: Lingzhi Wang, Xingshan Zeng, Jinsong Guo, Kam-Fai Wong, Georg Gottlob

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper explores Machine Unlearning (MU), an emerging field that is gaining increased attention due to concerns about neural models unintentionally remembering personal or sensitive information. We present SEUL, a novel method that enables selective and fine-grained unlearning for language models. Furthermore, we introduce two innovative evaluation metrics, sensitive extraction likelihood (S-EL) and sensitive memorization accuracy (S-MA), specifically designed to assess the effectiveness of forgetting sensitive information. In support of the unlearning framework, we propose efficient automatic online and offline sensitive span annotation methods. The paper also includes sections titled "Experimental Setup" and "Experimental Results" discussing evaluations on datasets and comparisons to baselines.
Researcher Affiliation	Collaboration	Lingzhi Wang 1Harbin Institute of Technology, Shenzhen, China; Xingshan Zeng2Huawei Noah s Ark Lab, China; Jinsong Guo3Unlimidata Ltd, United Kingdom; Kam-Fai Wong4The Chinese University of Hong Kong, China; Georg Gottlob6University of Calabria, Italy. The affiliations include both academic institutions (Harbin Institute of Technology, The Chinese University of Hong Kong, University of Calabria) and industry labs/companies (Huawei Noah s Ark Lab, Unlimidata Ltd).
Pseudocode	No	The paper describes methods like online selection and offline annotation in prose and provides mathematical formulations (e.g., Equation 1 and 2), but it does not include any explicitly labeled pseudocode blocks or algorithms with structured steps formatted like code.
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the methodology described is publicly available. Phrases like "We release our code..." or direct repository links are absent.
Open Datasets	Yes	The forget set is sourced from the Training Data Extraction Challenge (https://github.com/google-research/lm-extraction-benchmark). To evaluate general language modeling capabilities, we employ 8 classification tasks (i.e., Hellaswag (Zellers et al. 2019), Winogrande (Sakaguchi et al. 2021), and COPA (Gordon, Kozareva, and Roemmele 2012), and ARC-Easy (Clark et al. 2018), ARC-Challenge (Clark et al. 2018), Piqa (Bisk et al. 2020), Math QA (Amini et al. 2019), Pubmed QA (Jin et al. 2019) benchmarks) and 4 dialogue tasks (Wizard of Wikipedia (Dinan et al. 2019), Empathetic Dialogues (Rashkin et al. 2019), Blended Skill Talk (Smith et al. 2020), and Wizard of Internet (Komeili, Shuster, and Weston 2022)).
Dataset Splits	No	The paper mentions using a "forget dataset Df" and a "test set Dt" for evaluation, and specifies the forget set comprises 15,000 examples. It lists various benchmark datasets used for classification and dialogue tasks. However, it does not provide explicit details on how these datasets were split into training, validation, and test sets, either by percentages, sample counts, or references to specific predefined splits used for their experiments beyond mentioning the datasets themselves.
Hardware Specification	Yes	All the models are trained with a single Nvidia Ge Force RTX 3090.
Software Dependencies	No	The paper mentions the use of pre-trained language models like GPT-Neo series (125M, 1.3B, and 2.7B), Llama2-7B and Mistral-7B, but does not specify versions for any ancillary software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version) used for the implementation.
Experiment Setup	Yes	The learning rate for training is set to 5 10 5, based on the selection from [2 10 5, 5 10 5, 1 10 4]. The variable denoting the number of forgetting instances, represented as d, is examined across the values d = 1, 2, 4, 8, 16, 32, 64, 128. Unless otherwise specified, the reported results in this paper are based on the d = 32 setting. We adapt the global batch size during training to be the same as d, the number of forgetting instances, following Jang et al. (2023). Each setting is run 5 times and the reported results are the average of 5 different runs.