Sparse-to-Sparse Training of Diffusion Models

Authors: Inês Cardoso Oliveira, Decebal Constantin Mocanu, Luis A. Leiva

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that sparse DMs are able to match and often outperform their Dense counterparts, while substantially reducing the number of trainable parameters and FLOPs. We also identify safe and effective values to perform sparse-to-sparse training of DMs.
Researcher Affiliation Academia Inês Cardoso Oliveira EMAIL University of Luxembourg Decebal Constantin Mocanu EMAIL University of Luxembourg Luis A. Leiva EMAIL University of Luxembourg
Pseudocode Yes Algorithm 1 Static-DM; Algorithm 2 Rig L-DM and Mag Ran-DM
Open Source Code Yes Open Science: Our code and models are available at https://github.com/iclbo/sparse_to_sparse_ diffusion
Open Datasets Yes We evaluate on the LSUN-Bedrooms (Yu et al., 2015), Celeb A-HQ (Karras et al., 2018) and Imagenette (Howard, 2019) datasets. [...] We evaluate it on Kanji VG, Quick Draw (Ha & Eck, 2018), and VMNIST (Das et al., 2022).
Dataset Splits Yes Due to computing limitations, we use 12500/500 training/validation images for Celeb A-HQ and 10598/2500 images for LSUN-Bedrooms.
Hardware Specification Yes The experiments presented in this paper were carried out using the HPC facilities of the University of Luxembourg (https://hpc.uni.lu) and Luxembourg s national supercomputer Melu Xina. The authors gratefully acknowledge the ULHPC and Lux Provide teams for their expert support. [...] equipped with NVIDIA Tesla V100 SXM2 and A100 GPUs.
Software Dependencies No The paper mentions 'Adam W optimizer' and 'torch-fidelity Python package' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Latent Diffusion on LSUN-Bedrooms: We use a batch size of 12, Adam W optimizer with weight decay 1e-2 and static learning rate 2.4e-5. We train for 150 epochs. We use 1000 Denoising steps (T), linear noise schedule from 0.0015 to 0.0195, and sinusoidal embeddings for the timestep. Latent Diffusion on Celeb A-HQ: We use a batch size of 12, Adam W optimizer with weight decay 1e-2 and static learning rate 2.0e-06. We train for 150 epochs. We use 1000 Denoising steps (T), linear noise schedule from 0.0015 to 0.0195, and sinusoidal embeddings for the timestep. Latent Diffusion on Imagenette: We use a batch size of 12, Adam W optimizer with weight decay 1e-2 and static learning rate 2.4e-5. We train for 150 epochs. We use 1000 Denoising steps (T), linear noise schedule from 0.0015 to 0.0195, and sinusoidal embeddings for the timestep. Chiro Diff on Quick Draw: We use a batch size of 128, Adam W optimizer with weight decay 1e-2 and static learning rate 1e-3. We train for 600 epochs. We use 1000 Denoising steps (T), linear noise schedule from 1e-4 to 2e-2, and random Fourier features for the timestep embedding. Chiro Diff on Kanji VG: We use a batch size of 128, Adam W optimizer with weight decay 1e-2 and static learning rate 1e-3. We train for 600 epochs. We use 1000 Denoising steps (T), linear noise schedule from 1e-4 to 2e-2, and random Fourier features for the timestep embedding. Chiro Diff on VMNIST: We use a batch size of 128, Adam W optimizer with weight decay 1e-2 and static learning rate 1e-3. We train for 600 epochs. We use 1000 Denoising steps (T), linear noise schedule from 1e-4 to 2e-2, and random Fourier features for the timestep embedding.