Sparse-to-Sparse Training of Diffusion Models
Authors: Inês Cardoso Oliveira, Decebal Constantin Mocanu, Luis A. Leiva
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that sparse DMs are able to match and often outperform their Dense counterparts, while substantially reducing the number of trainable parameters and FLOPs. We also identify safe and effective values to perform sparse-to-sparse training of DMs. |
| Researcher Affiliation | Academia | Inês Cardoso Oliveira EMAIL University of Luxembourg Decebal Constantin Mocanu EMAIL University of Luxembourg Luis A. Leiva EMAIL University of Luxembourg |
| Pseudocode | Yes | Algorithm 1 Static-DM; Algorithm 2 Rig L-DM and Mag Ran-DM |
| Open Source Code | Yes | Open Science: Our code and models are available at https://github.com/iclbo/sparse_to_sparse_ diffusion |
| Open Datasets | Yes | We evaluate on the LSUN-Bedrooms (Yu et al., 2015), Celeb A-HQ (Karras et al., 2018) and Imagenette (Howard, 2019) datasets. [...] We evaluate it on Kanji VG, Quick Draw (Ha & Eck, 2018), and VMNIST (Das et al., 2022). |
| Dataset Splits | Yes | Due to computing limitations, we use 12500/500 training/validation images for Celeb A-HQ and 10598/2500 images for LSUN-Bedrooms. |
| Hardware Specification | Yes | The experiments presented in this paper were carried out using the HPC facilities of the University of Luxembourg (https://hpc.uni.lu) and Luxembourg s national supercomputer Melu Xina. The authors gratefully acknowledge the ULHPC and Lux Provide teams for their expert support. [...] equipped with NVIDIA Tesla V100 SXM2 and A100 GPUs. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer' and 'torch-fidelity Python package' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Latent Diffusion on LSUN-Bedrooms: We use a batch size of 12, Adam W optimizer with weight decay 1e-2 and static learning rate 2.4e-5. We train for 150 epochs. We use 1000 Denoising steps (T), linear noise schedule from 0.0015 to 0.0195, and sinusoidal embeddings for the timestep. Latent Diffusion on Celeb A-HQ: We use a batch size of 12, Adam W optimizer with weight decay 1e-2 and static learning rate 2.0e-06. We train for 150 epochs. We use 1000 Denoising steps (T), linear noise schedule from 0.0015 to 0.0195, and sinusoidal embeddings for the timestep. Latent Diffusion on Imagenette: We use a batch size of 12, Adam W optimizer with weight decay 1e-2 and static learning rate 2.4e-5. We train for 150 epochs. We use 1000 Denoising steps (T), linear noise schedule from 0.0015 to 0.0195, and sinusoidal embeddings for the timestep. Chiro Diff on Quick Draw: We use a batch size of 128, Adam W optimizer with weight decay 1e-2 and static learning rate 1e-3. We train for 600 epochs. We use 1000 Denoising steps (T), linear noise schedule from 1e-4 to 2e-2, and random Fourier features for the timestep embedding. Chiro Diff on Kanji VG: We use a batch size of 128, Adam W optimizer with weight decay 1e-2 and static learning rate 1e-3. We train for 600 epochs. We use 1000 Denoising steps (T), linear noise schedule from 1e-4 to 2e-2, and random Fourier features for the timestep embedding. Chiro Diff on VMNIST: We use a batch size of 128, Adam W optimizer with weight decay 1e-2 and static learning rate 1e-3. We train for 600 epochs. We use 1000 Denoising steps (T), linear noise schedule from 1e-4 to 2e-2, and random Fourier features for the timestep embedding. |