reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Utility and Complexity of In- and Out-of-Distribution Machine Unlearning

Authors: Youssef Allouah, Joshua Kazdan, Rachid Guerraoui, Sanmi Koyejo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Figure 1: Numerical validation on a linear regression task with synthetic data for the same unlearning budget, with in-distribution (left) and out-of-distribution (right) data. The in-distribution forget set is sampled at random, while the out-of-distribution data is obtained by shifting labels with a fixed offset. Additional details and results on real data can be found in Appendix F.
Researcher Affiliation	Academia	Youssef Allouah1 , Joshua Kazdan2, Rachid Guerraoui1, Sanmi Koyejo2 1EPFL, Switzerland 2Stanford University, USA EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Unlearning via Noisy Minimizer Approximation Algorithm 2 Unlearning via Robust Training and Noisy Minimizer Approximation
Open Source Code	No	The paper does not explicitly state that source code for its methodology is made available. It mentions using "the DP-SGD implementation of Opacus (Yousefpour et al., 2021)" which refers to a third-party library, not their own code.
Open Datasets	Yes	We extend the empirical validation in Figure 1 in the out-of-distribution scenario to the California Housing dataset (Pace and Barry, 1997), a standard regression benchmark.
Dataset Splits	No	The paper describes how forget data is sampled from a larger set of training samples (e.g., "f = 20 forget data out of 10,000 samples", "f {1, 0.1n, 0.45n} forget data out of 1,000 samples"), and for the California Housing dataset it states "The dataset contains 20,640 samples". However, it does not provide specific train/test/validation splits (e.g., percentages or exact counts) for the overall datasets used to initially train the models, nor does it reference standard splits.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions using "the DP-SGD implementation of Opacus (Yousefpour et al., 2021)" but does not specify version numbers for Opacus or any other software components.
Experiment Setup	Yes	The full data features are generated from a d-dimensional Gaussian N(0, Id), d = 100, and the labels are generated from the features and a random true underlying model, also from a Gaussian N(0, Id), with a Gaussian response. The in-distribution forget data consists of f = 20 data points sampled at random from the 10,000 full training samples. Moreover, we set the unlearning budget for the in-distribution scenario to ε = 1. For the out-of-distribution scenario, the forget data is obtained by shifting labels with a fixed offset set to 103, the total number of training samples being 1,000. Moreover, we set the unlearning budget for the in-distribution scenario to ε = 10. The learning rates for Algorithm 1 and 2 are set following standard theoretical convergence rates (Nesterov et al., 2018) for strongly convex tasks, and Theorem 3. We use the DP-SGD implementation of Opacus (Yousefpour et al., 2021), ... and run the optimizer until convergence or for 100 epochs (usually 10 more than Algorithm 1) with a fine-tuned learning rate.