Exploring Local Memorization in Diffusion Models via Bright Ending Attention

Authors: Chen Chen, Daochang Liu, Mubarak Shah, Chang Xu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that this integration significantly enhances the performance of these existing tasks by narrowing the gap caused by local memorization, further underscoring the contribution of BE. Our results not only validate the successful execution of the new localization task but also establish new state-of-the-art performance across all existing tasks, underscoring the significance of the BE phenomenon.
Researcher Affiliation Academia 1School of Computer Science, Faculty of Engineering, The University of Sydney, Australia 2School of Physics, Mathematics and Computing, The University of Western Australia, Australia 3Center for Research in Computer Vision, University of Central Florida, USA {cche0711@uni., c.xu@}sydney.edu.au EMAIL EMAIL
Pseudocode No The paper describes methodologies, but it does not contain any clearly labeled pseudocode blocks or algorithms.
Open Source Code No The paper does not provide a direct link to a source-code repository, an explicit statement of code release, or mention code in supplementary materials for the methodology described.
Open Datasets Yes We conducted experiments on Stable Diffusion v1-4. We adhered to the baseline prompt dataset (Wen et al., 2024), using 500 prompts each from Lexica, LAION, COCO, and random captions as non-memorized prompts. We used the dataset organized by Webster (2023) for memorized prompts. Similar findings were observed for unconditional generations in pretrained DDPMs on CIFAR-10 (Krizhevsky, 2009), as well as DDPMs trained on smaller datasets like Celeb A (Liu et al., 2015) and Oxford Flowers (Nilsback & Zisserman, 2008).
Dataset Splits No The paper states: 'using 500 prompts each from Lexica, LAION, COCO, and random captions as non-memorized prompts. We used the dataset organized by Webster (2023) for memorized prompts. However, since not all 500 prompts in Webster (2023) s dataset are prone to memorization, we selected 300 memorized prompts for our experiments. For each memorized and non-memorized prompt, we generated 16 images.' This describes the selection of prompts and generation of images for evaluation, but not specific training/test/validation splits of the underlying datasets (LAION, COCO, etc.) used for model training.
Hardware Specification Yes The inference process takes about 2 seconds per generation using RTX4090. While this integration does require an additional inference pass (e.g., 50 denoising steps, approximately 2 seconds on an RTX-4090) to extract the BE mask, the trade-off is well justified by the method s significant contributions to addressing local memorization challenges.
Software Dependencies No The paper mentions 'Stable Diffusion v1-4' as the model used and refers to a 'baseline method from Wen et al. (2024)', but does not provide specific version numbers for software libraries, programming languages, or other ancillary software components.
Experiment Setup Yes Following previous works, we conducted experiments on Stable Diffusion v1-4. We adhered to the baseline prompt dataset (Wen et al., 2024), using 500 prompts each from Lexica, LAION, COCO, and random captions as non-memorized prompts. We used the dataset organized by Webster (2023) for memorized prompts. However, since not all 500 prompts in Webster (2023) s dataset are prone to memorization, we selected 300 memorized prompts for our experiments. For each memorized and non-memorized prompt, we generated 16 images. We follow the baseline methodology from Wen et al. (2024) and evaluate performance using the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, the True Positive Rate at 1% False Positive Rate (T@1%F), and the F1 score, and report the detection performance during the 1, 10, and 50 inference steps. Implementationally, we experimented with the cross-attention maps on different layers of the U-Net and found that averaging the first two 64-pixel downsampling layers most effectively extracts the local memorization mask.