Low Compute Unlearning via Sparse Representations
Authors: Vedant Shah, Frederik Träuble, Ashish Malik, Hugo Larochelle, Michael Curtis Mozer, Sanjeev Arora, Yoshua Bengio, Anirudh Goyal
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed technique on the problem of class unlearning using four datasets: CIFAR-10, CIFAR-100, LACUNA100 and Image Net-1k. We compare the proposed technique to SCRUB, a state-of-the-art approach which uses knowledge distillation for unlearning. Across all four datasets, the proposed technique performs as well as, if not better than SCRUB while incurring almost no computational cost. |
| Researcher Affiliation | Academia | Vedant Shah EMAIL Mila, Université de Montréal Frederik Träuble MPI, Tübingen Ashish Malik University of Oregon Hugo Larochelle Mila, Université de Montréal Michael Mozer University of Colorado, Boulder Sanjeev Arora Princeton University Yoshua Bengio Mila, Université de Montréal Anirudh Goyal EMAIL Mila |
| Pseudocode | Yes | Algorithm 1: Unlearning via Activations Algorithm 2: Unlearning via Examples |
| Open Source Code | No | The paper mentions a third-party library: "We use the fvcore1 library for computing the number FLOPs required during the forward passes.1https://github.com/facebookresearch/fvcore/". However, it does not provide any statement or link to the source code for the methodology described in this paper. |
| Open Datasets | Yes | We validate the proposed methods using experiments across four base datasets: CIFAR-10 with 10 distinct classes, CIFAR-100 (Krizhevsky et al., 2009) with 100 distinct classes, LACUNA100 (Golatkar et al., 2020a) with 100 distinct classes and Image Net-1k (Russakovsky et al., 2015) with 1000 distinct classes. LACUNA-100 is derived from VGG-Faces (Cao et al., 2018) by sampling 100 different celebrities and sampling 500 images per celebrity, out of which 400 are used as training data and the rest are used as test images. |
| Dataset Splits | Yes | Let Dtrain = {xi, yi}N i=1 be a training dataset and Dtest be the corresponding test dataset. In our experiments, we consider the setting of class unlearning, wherein we aim to unlearn a class c from a model trained with a multiclass classification objective on Dtrain. c is called the forget class or the forget set. Given c, we obtain Dforget train Dtrain such that Dforget train = {(x, y) Dtrain|y = c}. The complement of Dforget train is Dretain train , i.e., subset of Dtrain that we wish to retain. Thus Dretain train Dforget train = Dtrain. Similarly, from Dtest, we have Dforget test = {(x, y) Dtest|y = c} and its complement Dretain test . We refer to Dretain train and Dretain test as the retain set training and test data; and Dforget train and Dforget test as the forget set training and test data, respectively. Table 3: Performance of the models on different sets of data after the initial training on the four datasets. We use two kinds of models: (a) models having a Discrete KV Bottleneck which are used for the proposed methods and (b) models where the DKVB and the decoder are replaced by a Linear Layer. These are used for the baseline. We wish to reduce the accuracy of these models on Dforget test to 0% while maintaining the accuracy on Dretain test . Experimental Setup We perform the experiment for CIFAR-10 with a Vi T/B-32 backbone. We divide the dataset into training data (DT rain), validation data (DV al) and test data (DT est). Training Data consists of 4000 examples per class; validation and test data consist of 1000 examples per class. |
| Hardware Specification | Yes | We perform all of our experiments on a 48GB RTX8000 GPU. |
| Software Dependencies | No | The paper mentions using "CLIP (Radford et al., 2021) pretrained Vi T-B/32" and loading "torchvision.models.Res Net50_Weights". It also references the "fvcore1 library". However, specific version numbers for software components like PyTorch, Python, or the aforementioned libraries are not provided. |
| Experiment Setup | Yes | We then train both model architectures on the full training sets of each dataset. Since the backbone is frozen, for the baseline models, only the weights of the linear layer are tuned during initial training (and later unlearning). Since we use only one linear layer, we do not do any pre-training (beyond the backbone), unlike in previous works (Kurmanji et al., 2023; Golatkar et al., 2020a;b). Table 3 shows the performance of these trained models on the train and test splits of the complete datasets. Tables 11, 12, 13, 14 provide hyperparameter details: Table 11: Hyperparameters used for training the base DKVB models; Table 12: Hyperparameters used for training the baseline models; Table 13: Hyperparameters for SCRUB + Linear Layer Experiments shown in Section 5.2.1; Table 14: Hyperparameters used for re-training experiments. |