A Mixture-Based Framework for Guiding Diffusion Models

Authors: Yazid Janati, Badr Moufad, Mehdi Abou El Qassime, Alain Oliviero Durmus, Eric Moulines, Jimmy Olsson

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach through extensive experiments on image inverse problems, utilizing both pixeland latent-space diffusion priors, as well as on source separation with an audio diffusion model. MGDM demonstrates strong empirical performance across 10 image-restoration tasks involving both pixel-space and latent-space diffusion models, as well as in musical source separation, even matching the performance of supervised methods. We evaluate MGDM on image inverse problems using both pixel-space and latent-space diffusion, as well as on musical source separation tasks. For the pixel-space diffusion and the audio diffusion model, we compare MGDM against seven competitors. We report the LPIPS metric (Zhang et al., 2018) in Tables 1 and 2 and defer the complete tables with FID, PSNR and SSIM along side 95% confidence interval to Table 6, Table 7, and Table 8. The SI-SDRI metric measures the improvement between the original audio source xi and the generated source ˆxi, relative to the mixture baseline y.
Researcher Affiliation Academia 1Ecole polytechnique 2KTH Royal Institute of Technology. Correspondence to: Yazid Janati, Badr Moufad <firstEMAIL>.
Pseudocode Yes Algorithm 1 Gibbs sampler targeting (13) ... Algorithm 2 MIXTURE-GUIDED DIFFUSION MODEL ... Algorithm 3 Gauss_VI routine ... Algorithm 4 Gibbs sampler targeting (13)
Open Source Code No Our code will be made available upon acceptange of the paper.
Open Datasets Yes We evaluate our method on a diverse set of six linear inverse problems and four nonlinear inverse problems with three different image priors with 256 256 resolution: the pixel-space FFHQ model of Choi et al. (2021), the latentspace FFHQ of Rombach et al. (2022), and the Image Net model of Dhariwal & Nichol (2021). The evaluation is conducted on the publicly available slakh2100 test dataset (Manilow et al., 2019) with the scale-invariant SDR improvement (SI-SDRI) metric (Roux et al., 2019).
Dataset Splits Yes The evaluation is done on a subset of 300 validation images per dataset. For FFHQ, we use the first 300 images, while for Image Net, we randomly sample 300 images to avoid class bias. We report the LPIPS metric (Zhang et al., 2018) in Tables 1 and 2 and defer the complete tables with FID, PSNR and SSIM along side 95% confidence interval to Table 6, Table 7, and Table 8. For the phase retrieval task specifically, we draw 4 samples for each algorithm and keep only the best scoring one in terms of LPIPS. A similar strategy is used in (Chung et al., 2023; Zhang et al., 2024; Wu et al., 2024). [...] Tracks from the test dataset are evaluated using a sliding window approach with 4-second chunks and a 2-second overlap.
Hardware Specification Yes All experiments were conducted on Nvidia Tesla V100 SXM2 GPUs.
Software Dependencies No The denoiser network is based on a non-latent, time-domain unconditional variant of (Schneider et al., 2023). Its architecture follows a U-Net design, comprising an encoder, bottleneck, and decoder. Training is performed on the four stacked instruments using the publicly available trainer from repository2. (footnote 2: https://github.com/archinetai/audio-diffusion-pytorch-trainer) ... In the anonymous codebase provided as companion of the paper we use αt instead of αt to match the conventions of existing codebases.
Experiment Setup Yes The details about the hyperparameters of MGDM are reported in Table 5. We adjust the optimization of the Gaussian Variational approximation in Algorithm 3 during the first and last diffusion steps. We ramp up the number of gradient steps during the final diffusion steps. This allows us to substantially improve the fine grained details of the reconstructions. Similarly, we reduce the learning rate in the early step to alleviate potential instabilities. We tune the parameters of our algorithm per dataset and not per task.