reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MUSE: Machine Unlearning Six-Way Evaluation for Language Models

Authors: Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah Smith, Chiyuan Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We address this issue by proposing MUSE, a comprehensive machine unlearning evaluation benchmark that enumerates six diverse desirable properties for unlearned models... Using these criteria, we benchmark how effectively eight popular unlearning algorithms on 7B-parameter LMs can unlearn Harry Potter books and news articles. Our results demonstrate that most algorithms can prevent verbatim memorization and knowledge memorization to varying degrees, but only one algorithm does not lead to severe privacy leakage.
Researcher Affiliation	Collaboration	1University of Washington 2Princeton University 3University of Southern California 4University of Chicago 5Google Research
Pseudocode	No	The paper describes unlearning methods like Gradient Ascent and Negative Preference Optimization (NPO) and provides the mathematical objective for NPO, but it does not present these or any other procedures in a structured pseudocode block labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	No	Code We will provide the code for all baseline methods, evaluation scripts used for benchmarking, as well as the code for visualizations and analysis presented in this paper.
Open Datasets	Yes	NEWS consists of BBC news articles (Li et al., 2023b) collected after August 2023. All articles are randomly divided into (disjoint) forget, retain, and holdout sets. BOOKS consists of the Harry Potter book series. To simulate a real-world setting for testing utility preservation (C4), we include different types of materials in the forget and retain sets. The forget set contains the original books, while the retain set contains related content from the Harry Potter Fan Wiki, harrypotter.fandom.com/wiki
Dataset Splits	Yes	All articles are randomly divided into (disjoint) forget, retain, and holdout sets. The sizes of the forget and retain sets are reported in tokens in (). Note that only the Verbatim texts within the Forget Set are included in our training data, while all Knowledge sets (QA pairs) serve for evaluations. ... To simulate sequential unlearning, we partition the extended NEWS forget set (comprised of 3.3M tokens) into four disjoint folds (each containing 0.8M tokens) and apply the unlearning methods to each fold in a sequential manner.
Hardware Specification	Yes	All experiments are conducted on 8 NVIDIA A40 GPU cards in a single node.
Software Dependencies	No	The paper mentions specific models like 'LLa MA-2 7B' and 'ICLM-7B' and an optimizer 'Adam W optimizer' but does not specify version numbers for any programming languages, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Following prior work (Maini et al., 2024), we run GA, NPO, and their regularized variants using the Adam W optimizer (Loshchilov & Hutter, 2017) with a constant learning rate of 10 5 and a batch size of 32. We employ the stopping criteria as follows: if the utility (i.e., Know Mem on Dretain) of a model undergoing unlearning drops below that of fretrain within 10 epochs of unlearning, we stop at the first epoch where this condition holds; otherwise, we take a checkpoint from the 10th epoch. For Task Vector and WHP, to obtain the reinforced model for unlearning, we fine-tune the target model for 10 epochs using the same learning rate and batch size.