Improved Sampling Algorithms for Lévy-Itô Diffusion Models

Authors: Vadim Popov, Assel Yermekova, Tasnima Sadekova, Artem Khrapov, Mikhail Kudinov

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the benefits of using these SDEs at inference in terms of generated samples quality on image generation task and verify that samples diversity does not suffer if we generate data with the proposed SDEs. We train a L evy-Itˆo text-to-speech model on a highly imbalanced dataset and evaluate its performance for speakers with different amount of training data. Section 5 is titled "EXPERIMENTS" and includes tables with metrics such as FID, coverage, and speaker similarity.
Researcher Affiliation Industry Vadim Popov, Assel Yermekova, Tasnima Sadekova, Huawei Noah s Ark Lab EMAIL Artem Khrapov & Mikhail Kudinov Huawei Noah s Ark Lab EMAIL, EMAIL
Pseudocode No The paper describes methods and equations verbally and mathematically but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper does not contain an explicit statement about releasing code, a link to a code repository, or a mention of code in supplementary materials for the described methodology.
Open Datasets Yes We train 3 L evy-Itˆo models with α = 1.8, 1.5 and 1.2 on CIFAR10 with the same architecture as in the mentioned paper... We train text-to-speech models on extremely imbalanced dataset consisting of 16.6 hours (1000 minutes) of an English female speaker (Ito, 2017) and 10 minutes of an English male speaker with id 9017 from Bakhturina et al. (2021).
Dataset Splits Yes We train 3 L evy-Itˆo models with α = 1.8, 1.5 and 1.2 on CIFAR10... The model we use for CIFAR10 experiments is NCSN++(deep) (Yoon et al., 2023; Song et al., 2021c) with 8 residual blocks... Imbalanced CIFAR10 contained 5000, 2997, 1796, 1077, 645, 387, 232, 139, 83 and 50 images belonging to classes airplane , automobile , bird , cat , deer , dog , frog , horse , ship and truck correspondingly. It is the same setting as that used in Yoon et al. (2023). Figure 4 shows performance of different models and different solvers depending on η... FID on CIFAR10 test set containing 10k images.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or other accelerators) used for running the experiments.
Software Dependencies No The paper mentions several software components and models (NCSN++, Montreal Forced Aligner, Hi Fi-GAN, CAM++ speaker verification model) but does not provide specific version numbers for these or other software dependencies (e.g., programming languages, libraries, frameworks).
Experiment Setup Yes The model we use for CIFAR10 experiments is NCSN++(deep) (Yoon et al., 2023; Song et al., 2021c) with 8 residual blocks. We train 3 models for α = 1.8, 1.5 and 1.2 with batch size 128 and learning rate 0.0001 for 250k iterations. Diffusion models tend to overfit on CIFAR10 so we choose the best checkpoint in terms of FID on the test set (100k, 150k and 180k iterations for α = 1.8, 1.5 and 1.2 respectively).