A Geometric Framework for Understanding Memorization in Generative Models

Authors: Brendan Ross, Hamidreza Kamkari, Tongzi Wu, Rasa Hosseinzadeh, Zhaoyan Liu, George Stein, Jesse Cresswell, Gabriel Loaiza-Ganem

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate the MMH using synthetic data and image datasets up to the scale of Stable Diffusion, developing new tools for detecting and preventing generation of memorized samples in the process.
Researcher Affiliation Industry Layer 6 AI EMAIL
Pseudocode No The paper describes methods and propositions through mathematical formulations and textual descriptions, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes To ensure the reproducibility of our experiments, we provide two codebase links. The first codebase, accessible at github.com/layer6ai-labs/dgm_geometry, contains our small-scale synthetic experiments and our CIFAR10 experiments. The second, accessible at github.com/layer6ai-labs/diffusion_memorization/, extends the work of Wen et al. (2023) to use the MMH to detect and mitigate memorization.
Open Datasets Yes We analyze the higher-dimensional CIFAR10 dataset (Krizhevsky & Hinton, 2009) and use two pretrained generative models... we retrieve memorized LAION (Schuhmann et al., 2022) training images identified by Webster (2023)... a mix of 2000 images sampled from LAION Aesthetics 6.5+, 2000 sampled from COCO (Lin et al., 2014), and all 251 images from the Tuxemon dataset (Tuxemon Project, 2024; Hugging Face, 2024). All datasets used in our experiments are freely available from the referenced sources and are utilized in compliance with their respective licenses.
Dataset Splits No The paper states that it uses pretrained generative models on CIFAR10 and Stable Diffusion. It describes how samples were generated and selected for analysis (e.g., 'generate 50,000 images', 'take the closest 250 neighbours'), but does not provide specific training, validation, and test splits for reproducing the training of a model within the scope of this paper.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or memory) used for running its experiments.
Software Dependencies No The paper mentions using the 'cv2 package' for PNG compression but does not provide specific version numbers for it or any other key software dependencies used in their experimental setup, other than linking to codebases.
Experiment Setup No The paper describes experimental methodologies for LID estimation and mitigation approaches, including some hyperparameters for LID estimation (e.g., t0 values for FLIPD and NB, k for Local PCA). However, it does not provide a comprehensive set of hyperparameters or system-level training settings for reproducing the full experimental setup of the generative models themselves or a detailed table of experimental configurations.