Shielded Diffusion: Generating Novel and Diverse Images using Sparse Repellency

Authors: Michael Kirchhof, James Thornton, Louis Béthune, Pierre Ablin, Eugene Ndiaye, Marco Cuturi

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now show that SPELL increases the diversity of modern text-to-image and class-conditional diffusion models (Section 5.2), with a better trade-off than other recent diversity methods (Section 5.3). We quantify the sparsity of SPELL interventions in Section 5.4. In Section 5.6, we demonstrate SPELL s scalability and a new use-case, shielded generation, by generating novel Image Net images while shielding all 1.2 million Image Net-1k train images. Table 1 shows that SPELL consistently increases the diversity, both in terms of recall and Vendi score, across all text-to-image and class-to-image diffusion models.
Researcher Affiliation Collaboration Michael Kirchhof 1 2 James Thornton 1 Louis B ethune 1 Pierre Ablin 1 Eugene Ndiaye 1 Marco Cuturi 1 1Apple 2University of T ubingen.
Pseudocode Yes Algorithm 1 gives a high-level pseudo-code for SPELL and Algorithm 2 details how we implemented SPELL in a parallelized way in Python. Algorithm 1 SPELL added to the backwards diffusion step. Algorithm 2: Our repellency can be added to the backwards algorithm of existing diffusion models, without retraining.
Open Source Code No The paper does not explicitly state that the authors' source code for SPELL is available, nor does it provide a direct link to a repository.
Open Datasets Yes In the class-to-image setup, we use Masked Diffusion Transformers (MDTv2) (Gao et al., 2023), EDMv2 (Karras et al., 2024), and Stable Diffusion 3 Medium (SD3) (Esser et al., 2024), three recent state-of-the-art diffusion models. We use the pretrained model checkpoints to generate 50,000 256x256 images of Image Net-1k classes(Deng et al., 2009) without and with SPELL and compare them to the original Image Net-1k images. In our text-to-image setup, we use SD3, Latent Diffusion (Rombach et al., 2022), and RGB-space Simple Diffusion (Hoogeboom et al., 2023) in resolution 256x256. For the latter two, we use the checkpoints of Gu et al. (2023). Details on hyperparameters are provided in Appendix D. We evaluate these models on CC12M (Changpinyo et al., 2021), a dataset of (caption, image) pairs, with captions ranging between 15 and 491 characters.
Dataset Splits Yes We randomly split them into a validation set of 554 captions and a test set of 5000 captions. Table 3 shows how many images belong to each caption.
Hardware Specification Yes The runtime is reported on a single A100-40GB GPU. Table 5. Generation times per image. Neither SPELL nor other diversity inducing methods add considerable runtime. The runtime is dominated by the diffusion backbone. Mean standard deviation across 500 images, run on an NVIDIA V100 GPU.
Software Dependencies No The paper mentions the use of 'Python' for implementation in Appendix D, and the 'Faiss library' in Appendix I, but does not specify version numbers for any software dependencies.
Experiment Setup Yes D. Implementation Details and Hyperparameters Since SPELL is a training-free post-hoc method, we use the trained checkpoints of diffusion models provided by their original authors. For EDMv2 and MDTv2, we use the hyperparameters suggested by their authors. Latent Diffusion, Simple Diffusion, and Stable Diffusion come without recommended hyperparameters, so we tune the classifier-free guidance (CFG) weight by the F-score between precision and coverage on the 554 validation captions on our CC12M split. EDMv2: CFG weight 1.2, 50 backwards steps, σmin = 0.002, σmax = 80, ρ = 7, Smin = 0, Smax = , repellence radius r = 20, batchsize 8. MDTv2: CFG weight 3.8, 50 backwards steps, repellence radius r = 45, batchsize 2. Stable Diffusion 3: CFG weight 5.5, 28 backwards steps, repellence radius r = 200, on CC12M overcompensation 1.6 (no overcompensation on Image Net), batchsize 8. Simple Diffusion: CFG weight 5.5, 50 backwards steps, repellence radius r = 50, overcompensation 1.6, batchsize 16. Latent Diffusion: CFG weight 5, 50 backwards steps, repellence radius r = 20, overcompensation 1.6, batchsize 8.