reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MemBench: Memorized Image Trigger Prompt Dataset for Diffusion Models

Authors: Chunsan Hong, Tae-Hyun Oh, Minhyuk Sung

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present Mem Bench, the first benchmark for evaluating image memorization mitigation methods for diffusion models. Our Mem Bench includes the following key features to ensure effective evaluation: (1) Mem Bench provides 3,000, 1,500, 309, and 1,352 memorized image trigger prompts for Stable Diffusion 1, 2, Deep Floyd IF (Shonenkov et al., 2023), and Realistic Vision (Civit AI, 2023), respectively. Through our Mem Bench evaluation, we revealed that existing memorization mitigation methods notably degrade overall performance of diffusion models and need to be further developed.
Researcher Affiliation	Academia	Chunsan Hong EMAIL KAIST School of Electrical Engineering Tae-Hyun Oh EMAIL KAIST School of Computing Minhyuk Sung EMAIL KAIST School of Computing
Pseudocode	Yes	Please refer to Algorithm 1 for details. Algorithm 1 Memorized Image Trigger Prompt Searching via Gibbs sampling Algorithm 2 Memorized Image Trigger Prompt Augmentation via Gibbs Sampling Algorithm 3 Diversity Sampling
Open Source Code	Yes	The code and datasets are available at https://github.com/chunsan Hong/Mem Bench_code
Open Datasets	Yes	The code and datasets are available at https://github.com/chunsan Hong/Mem Bench_code Secondly, the general prompt scenario ensures that the performance of the diffusion model does not degrade when using prompts other than trigger prompts. We leverage the COCO (Lin et al., 2014) validation set as general prompts.
Dataset Splits	Yes	To ensure that memorization mitigation methods can be generally applied to diffusion models, we provide two scenarios: the memorized image trigger prompt scenario and the general prompt scenario. First, the memorized image trigger prompt scenario evaluates whether mitigation methods can effectively prevent the generation of memorized images. This scenario uses the memorized image trigger prompts we identified in Section 3. We generate 10 images for each trigger prompt and measure the Top-1 SSCD and the mean values of the Top-3 SSCD. We also measure the proportion of images with SSCD exceeding 0.5. For CLIP Score and Aesthetic Score, we calculate the average value across all generated images. Second, the general prompt scenario ensures that the performance of the diffusion model does not degrade when using prompts other than trigger prompts. We leverage the COCO (Lin et al., 2014) validation set as general prompts.
Hardware Specification	Yes	Experiment was done on 1 A100 GPU. To generate 200 candidate prompts using Zero Cap, it took approximately 44 hours on an A100 GPU
Software Dependencies	No	The paper mentions BERT and CLIP models, as well as various diffusion models like Stable Diffusion, Deep Floyd IF. It also specifies using the DDIM Scheduler for image generation. However, it does not provide specific version numbers for any programming languages, libraries, or frameworks used in the implementation.
Experiment Setup	Yes	Image generation is performed using the DDIM (Song et al., 2021a) Scheduler with a guidance scale of 7.5 and 50 inference steps. The parameter n in the table indicates the number of words or numbers inserted. ... All other hyper-parameters followed the settings in the original paper: an Adam optimizer with a learning rate of 0.05 and a maximum of 10 steps was used for training. ... Here, the scale factor C for the beginning token s1 becomes a hyper-parameter. Table 7: Hyper-parameters leveraged in memorized image trigger prompt searching using our algorithm. Here, n represents the sentence length, N is the iteration number, Q denotes the number of proposal words, K stands for the temperature, κ is the termination threshold, s is the early stop counter threshold, and T is the number of return candidate prompts.