A Geometric Framework for Understanding Memorization in Generative Models
Authors: Brendan Ross, Hamidreza Kamkari, Tongzi Wu, Rasa Hosseinzadeh, Zhaoyan Liu, George Stein, Jesse Cresswell, Gabriel Loaiza-Ganem
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate the MMH using synthetic data and image datasets up to the scale of Stable Diffusion, developing new tools for detecting and preventing generation of memorized samples in the process. |
| Researcher Affiliation | Industry | Layer 6 AI EMAIL |
| Pseudocode | No | The paper describes methods and propositions through mathematical formulations and textual descriptions, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | To ensure the reproducibility of our experiments, we provide two codebase links. The first codebase, accessible at github.com/layer6ai-labs/dgm_geometry, contains our small-scale synthetic experiments and our CIFAR10 experiments. The second, accessible at github.com/layer6ai-labs/diffusion_memorization/, extends the work of Wen et al. (2023) to use the MMH to detect and mitigate memorization. |
| Open Datasets | Yes | We analyze the higher-dimensional CIFAR10 dataset (Krizhevsky & Hinton, 2009) and use two pretrained generative models... we retrieve memorized LAION (Schuhmann et al., 2022) training images identified by Webster (2023)... a mix of 2000 images sampled from LAION Aesthetics 6.5+, 2000 sampled from COCO (Lin et al., 2014), and all 251 images from the Tuxemon dataset (Tuxemon Project, 2024; Hugging Face, 2024). All datasets used in our experiments are freely available from the referenced sources and are utilized in compliance with their respective licenses. |
| Dataset Splits | No | The paper states that it uses pretrained generative models on CIFAR10 and Stable Diffusion. It describes how samples were generated and selected for analysis (e.g., 'generate 50,000 images', 'take the closest 250 neighbours'), but does not provide specific training, validation, and test splits for reproducing the training of a model within the scope of this paper. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the 'cv2 package' for PNG compression but does not provide specific version numbers for it or any other key software dependencies used in their experimental setup, other than linking to codebases. |
| Experiment Setup | No | The paper describes experimental methodologies for LID estimation and mitigation approaches, including some hyperparameters for LID estimation (e.g., t0 values for FLIPD and NB, k for Local PCA). However, it does not provide a comprehensive set of hyperparameters or system-level training settings for reproducing the full experimental setup of the generative models themselves or a detailed table of experimental configurations. |