reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Shielded Diffusion: Generating Novel and Diverse Images using Sparse Repellency

Authors: Michael Kirchhof, James Thornton, Louis Béthune, Pierre Ablin, Eugene Ndiaye, Marco Cuturi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now show that SPELL increases the diversity of modern text-to-image and class-conditional diffusion models (Section 5.2), with a better trade-off than other recent diversity methods (Section 5.3). We quantify the sparsity of SPELL interventions in Section 5.4. In Section 5.6, we demonstrate SPELL s scalability and a new use-case, shielded generation, by generating novel Image Net images while shielding all 1.2 million Image Net-1k train images. Table 1 shows that SPELL consistently increases the diversity, both in terms of recall and Vendi score, across all text-to-image and class-to-image diffusion models.
Researcher Affiliation	Collaboration	Michael Kirchhof 1 2 James Thornton 1 Louis B ethune 1 Pierre Ablin 1 Eugene Ndiaye 1 Marco Cuturi 1 1Apple 2University of T ubingen.
Pseudocode	Yes	Algorithm 1 gives a high-level pseudo-code for SPELL and Algorithm 2 details how we implemented SPELL in a parallelized way in Python. Algorithm 1 SPELL added to the backwards diffusion step. Algorithm 2: Our repellency can be added to the backwards algorithm of existing diffusion models, without retraining.
Open Source Code	No	The paper does not explicitly state that the authors' source code for SPELL is available, nor does it provide a direct link to a repository.
Open Datasets	Yes	In the class-to-image setup, we use Masked Diffusion Transformers (MDTv2) (Gao et al., 2023), EDMv2 (Karras et al., 2024), and Stable Diffusion 3 Medium (SD3) (Esser et al., 2024), three recent state-of-the-art diffusion models. We use the pretrained model checkpoints to generate 50,000 256x256 images of Image Net-1k classes(Deng et al., 2009) without and with SPELL and compare them to the original Image Net-1k images. In our text-to-image setup, we use SD3, Latent Diffusion (Rombach et al., 2022), and RGB-space Simple Diffusion (Hoogeboom et al., 2023) in resolution 256x256. For the latter two, we use the checkpoints of Gu et al. (2023). Details on hyperparameters are provided in Appendix D. We evaluate these models on CC12M (Changpinyo et al., 2021), a dataset of (caption, image) pairs, with captions ranging between 15 and 491 characters.
Dataset Splits	Yes	We randomly split them into a validation set of 554 captions and a test set of 5000 captions. Table 3 shows how many images belong to each caption.
Hardware Specification	Yes	The runtime is reported on a single A100-40GB GPU. Table 5. Generation times per image. Neither SPELL nor other diversity inducing methods add considerable runtime. The runtime is dominated by the diffusion backbone. Mean standard deviation across 500 images, run on an NVIDIA V100 GPU.
Software Dependencies	No	The paper mentions the use of 'Python' for implementation in Appendix D, and the 'Faiss library' in Appendix I, but does not specify version numbers for any software dependencies.
Experiment Setup	Yes	D. Implementation Details and Hyperparameters Since SPELL is a training-free post-hoc method, we use the trained checkpoints of diffusion models provided by their original authors. For EDMv2 and MDTv2, we use the hyperparameters suggested by their authors. Latent Diffusion, Simple Diffusion, and Stable Diffusion come without recommended hyperparameters, so we tune the classifier-free guidance (CFG) weight by the F-score between precision and coverage on the 554 validation captions on our CC12M split. EDMv2: CFG weight 1.2, 50 backwards steps, σmin = 0.002, σmax = 80, ρ = 7, Smin = 0, Smax = , repellence radius r = 20, batchsize 8. MDTv2: CFG weight 3.8, 50 backwards steps, repellence radius r = 45, batchsize 2. Stable Diffusion 3: CFG weight 5.5, 28 backwards steps, repellence radius r = 200, on CC12M overcompensation 1.6 (no overcompensation on Image Net), batchsize 8. Simple Diffusion: CFG weight 5.5, 50 backwards steps, repellence radius r = 50, overcompensation 1.6, batchsize 16. Latent Diffusion: CFG weight 5, 50 backwards steps, repellence radius r = 20, overcompensation 1.6, batchsize 8.