Improved Sampling Algorithms for Lévy-Itô Diffusion Models
Authors: Vadim Popov, Assel Yermekova, Tasnima Sadekova, Artem Khrapov, Mikhail Kudinov
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the benefits of using these SDEs at inference in terms of generated samples quality on image generation task and verify that samples diversity does not suffer if we generate data with the proposed SDEs. We train a L evy-Itˆo text-to-speech model on a highly imbalanced dataset and evaluate its performance for speakers with different amount of training data. Section 5 is titled "EXPERIMENTS" and includes tables with metrics such as FID, coverage, and speaker similarity. |
| Researcher Affiliation | Industry | Vadim Popov, Assel Yermekova, Tasnima Sadekova, Huawei Noah s Ark Lab EMAIL Artem Khrapov & Mikhail Kudinov Huawei Noah s Ark Lab EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and equations verbally and mathematically but does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code, a link to a code repository, or a mention of code in supplementary materials for the described methodology. |
| Open Datasets | Yes | We train 3 L evy-Itˆo models with α = 1.8, 1.5 and 1.2 on CIFAR10 with the same architecture as in the mentioned paper... We train text-to-speech models on extremely imbalanced dataset consisting of 16.6 hours (1000 minutes) of an English female speaker (Ito, 2017) and 10 minutes of an English male speaker with id 9017 from Bakhturina et al. (2021). |
| Dataset Splits | Yes | We train 3 L evy-Itˆo models with α = 1.8, 1.5 and 1.2 on CIFAR10... The model we use for CIFAR10 experiments is NCSN++(deep) (Yoon et al., 2023; Song et al., 2021c) with 8 residual blocks... Imbalanced CIFAR10 contained 5000, 2997, 1796, 1077, 645, 387, 232, 139, 83 and 50 images belonging to classes airplane , automobile , bird , cat , deer , dog , frog , horse , ship and truck correspondingly. It is the same setting as that used in Yoon et al. (2023). Figure 4 shows performance of different models and different solvers depending on η... FID on CIFAR10 test set containing 10k images. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or other accelerators) used for running the experiments. |
| Software Dependencies | No | The paper mentions several software components and models (NCSN++, Montreal Forced Aligner, Hi Fi-GAN, CAM++ speaker verification model) but does not provide specific version numbers for these or other software dependencies (e.g., programming languages, libraries, frameworks). |
| Experiment Setup | Yes | The model we use for CIFAR10 experiments is NCSN++(deep) (Yoon et al., 2023; Song et al., 2021c) with 8 residual blocks. We train 3 models for α = 1.8, 1.5 and 1.2 with batch size 128 and learning rate 0.0001 for 250k iterations. Diffusion models tend to overfit on CIFAR10 so we choose the best checkpoint in terms of FID on the test set (100k, 150k and 180k iterations for α = 1.8, 1.5 and 1.2 respectively). |