reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Authors: Bartosz Cywiński, Kamil Deja

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation shows that SAe Uron outperforms existing approaches on the Unlearn Canvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAe Uron mitigates the possibility of generating unwanted content under adversarial attack. Code and checkpoints are available at Git Hub.
Researcher Affiliation	Collaboration	1Warsaw University of Technology 2IDEAS NCBR 3IDEAS Research Institute. Correspondence to: Bartosz Cywi nski <EMAIL>.
Pseudocode	Yes	An overview of this procedure is shown in Figure 2, with pseudocode provided in Appendix S.
Open Source Code	Yes	Code and checkpoints are available at Git Hub.
Open Datasets	Yes	We evaluate our method on the recently proposed large and competitive benchmark Unlearn Canvas (Zhang et al., 2024c) which assesses unlearning effectiveness across 20 objects and 50 styles. ... Evaluation on I2P shows that our approach also effectively removes nudity. ... We train SAE on SD-v1.4 activations gathered from a random 30K captions from COCO train 2014. Additionally, we add to the train set prompts naked man and naked woman to enable SAE to learn nudity-relevant features.
Dataset Splits	Yes	SAE training dataset To ensure a fair evaluation, the SAE training set is comprised of text prompts that are distinct from those employed in the evaluation on the Unlearn Canvas benchmark. Specifically, we utilize simple one-sentence prompts (referred to as anchor prompts), which were employed by the authors of the benchmark in training of the CA method (Kumari et al., 2023). For each of the 20 objects, we use 80 prompts. Additionally, to enable the SAE to learn the styles used in the benchmark, we append the postfix in {style} style. to each prompt. ... Validation dataset for feature score calculation To calculate feature scores during the unlearning of concept c, we collect feature activations fi(xt) at each denoising timestep t using a validation set D of anchor prompts, similar to SAE s training set. Following the Unlearn Canvas evaluation setup, activations are gathered over 100 denoising timesteps. Despite being trained on 50 steps, SAEs generalize well to this extended range. For style unlearning, we use 20 prompts per style and for object unlearning 80 per object.
Hardware Specification	Yes	Both SAEs were trained on a single NVIDIA RTX A5000 GPU.
Software Dependencies	No	Optimization uses Adam (Kingma, 2014) with a learning rate of 0.0004 and a linear scheduler without warmup. We set the batch size to 4096 and unit-normalize decoder weights after each training step. No specific version numbers for software dependencies are provided.
Experiment Setup	Yes	Hyperparameters Our method uses two hyperparameters tunable for each concept c separately: number of blocked features τc and negative multiplier γc. For style unlearning we empirically observed that setting τc = 1 and γc = 1 yield satisfying results across all styles. For the case of object unlearning we tune hyperparameters on the validation dataset, presenting the selected values in Appendix G. ... We train our Batch Top K sparse autoencoders with k = 32 and an expansion factor of 16. Optimization uses Adam (Kingma, 2014) with a learning rate of 0.0004 and a linear scheduler without warmup. We set the batch size to 4096 and unit-normalize decoder weights after each training step. We train the SAE on the up.1.1 object block for 5 epochs and on the up.1.2 style block for 10 epochs.