reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hessian-Free Online Certified Unlearning

Authors: Xinbao Qiao, Meng Zhang, Ming Tang, Ermin Wei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that our proposed scheme surpasses existing results by orders of magnitude in terms of time/storage costs with millisecond-level unlearning execution, while also enhancing test accuracy. We conduct experimental evaluations using a wider range of metrics compared to previous theoretical second-order studies, and release our open source code.1 The experimental results verify our theoretical analysis and demonstrate that our proposed approach surpasses previous certified unlearning works. In particular, our algorithm incurs millisecond-level unlearning runtime to forget per sample with minimal performance degradation.
Researcher Affiliation	Academia	1Zhejiang University 2Southern University of Science and Technology 3Northwestern University EMAIL
Pseudocode	Yes	Algorithm 1: Hessian-Free Online Unlearning (HF) Algorithm
Open Source Code	Yes	We conduct experimental evaluations using a wider range of metrics compared to previous theoretical second-order studies, and release our open source code.1 The experimental results verify our theoretical analysis and demonstrate that our proposed approach surpasses previous certified unlearning works.
Open Datasets	Yes	We conduct experiments in both convex and non-convex scenarios. Specifically, we trained a multinomial Logistic Regression (LR) with total parameters d = 7850 and a simple convolutional neural network (CNN) with total parameters d = 21840 on MNIST dataset (Deng (2012)) for handwriting digit classification. We further evaluate using larger-scale model Res Net-18 (He et al. (2016)) which features 11M parameters with three datasets: CIFAR-10 (Alex (2009)) for image classification, Celeb A (Liu et al. (2015)) for gender prediction, and LFWPeople (Huang et al. (2007)) for face recognition across 29 different individuals.
Dataset Splits	Yes	We train LR and CNN on MNIST with 1,000 training data 20% data points to be forgotten, which have setups identical to the aforementioned verification experiments I. We further evaluate on FMNIST with 4,000 training data and 20% data points to be forgotten using CNN and Le Net with a total of 61,706 parameters. We conducted evaluation on Res Net-18 trained on CIFAR-10 with 50,000 samples. We conducted evaluation on Res Net-18 trained on LFW with 984 samples, for the classification of 29 facial categories. We conducted evaluation on Res Net-18 trained on Celeb A with 10,000 samples.
Hardware Specification	Yes	The experiments were conducted on the NVIDIA Ge Force RTX 4090. Our comprehensive tests were conducted on AMD EPYC 7763 CPU @1.50GHz with 64 cores under Ubuntu20.04.6 LTS.
Software Dependencies	Yes	The code were implemented in Py Torch 2.0.0 and leverage the CUDA Toolkit version 11.8.
Experiment Setup	Yes	For LR, training was performed for 15 epochs with a stepsize of 0.05 and a batch of 32. For CNN, training was carried out for 20 epochs with a stepsize of 0.05 and a batch size of 64. Given these configurations, we separately assess the distance and correlation between approximators a U HF , a U NS , a U IJ at deletion rates in the set {1%, 5%, 10%, 15%, 20%, 25%, 30%}. Following the suggestion in Basu et al. (2021), a damping factor of 0.01 is added to the Hessian to ensure its invertibility when implementing NS and IJ.