reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Large Language Model Continual Unlearning

Authors: Chongyang Gao, Lixu Wang, Kaize Ding, Chenkai Weng, Xiao Wang, Qi Zhu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted extensive experiments on O3 and state-of-the-art LLM unlearning methods across three tasks and seven datasets. The results indicate that O3 consistently achieves the best unlearning effectiveness and utility preservation, especially when facing continuous unlearning requests.
Researcher Affiliation	Academia	1Northwestern University, 2Arizona State University EMAIL, EMAIL EMAIL
Pseudocode	Yes	The detailed pipeline of unlearning knowledge detection is shown in Algorithm 1. At a high level, the module fine-tunes an out-of-distribution (OOD) detector backbone model on the data of the t-th unlearning request with the contrastive entropy loss. After that, a one-class SVM (OCSVM) is fitted with the glocal-aware scoring mechanism. The OOD detector backbone and the fitted OCSVM are used to assess the input and unlearning data similarity, which allows the O3 framework to decide whether and to what extent to load the unlearning Lo RA in the inference phase. In addition, the soft-weighted inference of O3 framework is shown in Algorithm 2. The soft-weighted inference leverages the OOD module to assess the similarity between the input and seen unlearning data, then decides whether and to what extent to load the unlearning Lo RA.
Open Source Code	Yes	The source codes can be found at https://github.com/GCYZSL/O3-LLM-UNLEARNING.
Open Datasets	Yes	Question Answering. For Science QA (Lu et al., 2022b), we gather text-only samples to form a train and test set with 6,508 and 2,224 samples. We choose four domains in Science QA as continual unlearning requests, i.e., biology physics chemistry economics. We use Commonsense QA (Talmor et al., 2019) as a utility dataset, which contains 9,740 training samples and 1,221 validation samples for evaluating the commonsense reasoning capability of LLMs. Openbook QA (Mihaylov et al., 2018) can assess the book comprehension ability, consisting of 4,957 training, 500 validation, and 500 testing samples. Fictitious Knowledge Generation. TOFU (Maini et al., 2024) consists of questions about fake authors synthesized by GPT-4. Intent Classification. CLINC150 (mis, 2020) is designed for intent classification, which comprises 150 classes across five domains, and each includes 200 training, 40 validation, and 60 testing samples. For the unlearning settings, we choose three domains most related to privacy ( work , travel , and home ) as the continual unlearning requests. To evaluate the utility preservation, we leverage MRPC (Dolan & Brockett, 2005) and RTE (Wang et al.) on the task of paraphrase identification and textual entailment, respectively.
Dataset Splits	Yes	Question Answering. For Science QA (Lu et al., 2022b), we gather text-only samples to form a train and test set with 6,508 and 2,224 samples. [...] We use Commonsense QA (Talmor et al., 2019) as a utility dataset, which contains 9,740 training samples and 1,221 validation samples for evaluating the commonsense reasoning capability of LLMs. Openbook QA (Mihaylov et al., 2018) can assess the book comprehension ability, consisting of 4,957 training, 500 validation, and 500 testing samples. [...] Intent Classification. CLINC150 (mis, 2020) is designed for intent classification, which comprises 150 classes across five domains, and each includes 200 training, 40 validation, and 60 testing samples. [...] When dealing with the continually arriving unlearning requests, we first randomly divide the unlearning dataset DU,t into two subsets DU,t used and DU,t rest with αN U,t and (1 α)N U,t samples (α = 80% in our implementation), respectively.
Hardware Specification	No	The paper mentions models like LLa MA2-7b and Roberta-large, but does not specify the actual hardware (GPU/CPU models, memory, etc.) used for experiments.
Software Dependencies	No	The paper mentions using specific models like LLa MA2-7b and Roberta-large, and the Adam W optimizer. It also states the use of Python for implementation implicitly by providing a GitHub link, but it does not specify version numbers for these software components or any other libraries (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	We use the Adam W optimizer with 3e-4 as the learning rate and 128 as the batch size for combined datasets. The epochs are 10 and 20 for Science QA-Commonsense QA-Openbook QA and CLINC150-MRPC-RTE. We set the Lo RA rank for all experiments to 8 and the alpha to 16.