SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
Authors: Bartosz Cywiński, Kamil Deja
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation shows that SAe Uron outperforms existing approaches on the Unlearn Canvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAe Uron mitigates the possibility of generating unwanted content under adversarial attack. Code and checkpoints are available at Git Hub. |
| Researcher Affiliation | Collaboration | 1Warsaw University of Technology 2IDEAS NCBR 3IDEAS Research Institute. Correspondence to: Bartosz Cywi nski <EMAIL>. |
| Pseudocode | Yes | An overview of this procedure is shown in Figure 2, with pseudocode provided in Appendix S. |
| Open Source Code | Yes | Code and checkpoints are available at Git Hub. |
| Open Datasets | Yes | We evaluate our method on the recently proposed large and competitive benchmark Unlearn Canvas (Zhang et al., 2024c) which assesses unlearning effectiveness across 20 objects and 50 styles. ... Evaluation on I2P shows that our approach also effectively removes nudity. ... We train SAE on SD-v1.4 activations gathered from a random 30K captions from COCO train 2014. Additionally, we add to the train set prompts naked man and naked woman to enable SAE to learn nudity-relevant features. |
| Dataset Splits | Yes | SAE training dataset To ensure a fair evaluation, the SAE training set is comprised of text prompts that are distinct from those employed in the evaluation on the Unlearn Canvas benchmark. Specifically, we utilize simple one-sentence prompts (referred to as anchor prompts), which were employed by the authors of the benchmark in training of the CA method (Kumari et al., 2023). For each of the 20 objects, we use 80 prompts. Additionally, to enable the SAE to learn the styles used in the benchmark, we append the postfix in {style} style. to each prompt. ... Validation dataset for feature score calculation To calculate feature scores during the unlearning of concept c, we collect feature activations fi(xt) at each denoising timestep t using a validation set D of anchor prompts, similar to SAE s training set. Following the Unlearn Canvas evaluation setup, activations are gathered over 100 denoising timesteps. Despite being trained on 50 steps, SAEs generalize well to this extended range. For style unlearning, we use 20 prompts per style and for object unlearning 80 per object. |
| Hardware Specification | Yes | Both SAEs were trained on a single NVIDIA RTX A5000 GPU. |
| Software Dependencies | No | Optimization uses Adam (Kingma, 2014) with a learning rate of 0.0004 and a linear scheduler without warmup. We set the batch size to 4096 and unit-normalize decoder weights after each training step. No specific version numbers for software dependencies are provided. |
| Experiment Setup | Yes | Hyperparameters Our method uses two hyperparameters tunable for each concept c separately: number of blocked features τc and negative multiplier γc. For style unlearning we empirically observed that setting τc = 1 and γc = 1 yield satisfying results across all styles. For the case of object unlearning we tune hyperparameters on the validation dataset, presenting the selected values in Appendix G. ... We train our Batch Top K sparse autoencoders with k = 32 and an expansion factor of 16. Optimization uses Adam (Kingma, 2014) with a learning rate of 0.0004 and a linear scheduler without warmup. We set the batch size to 4096 and unit-normalize decoder weights after each training step. We train the SAE on the up.1.1 object block for 5 epochs and on the up.1.2 style block for 10 epochs. |