Distillation of Discrete Diffusion through Dimensional Correlations

Authors: Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show the effectiveness of the proposed method in distilling pretrained discrete diffusion models across image and language domains. ... 5. Experimental results
Researcher Affiliation Collaboration 1Sony Group Corporation, Tokyo, Japan 2Sony AI, Tokyo, Japan 3The University of Tokyo, Tokyo, Japan.
Pseudocode No The paper describes methods in prose and equations (e.g., Section 3 and Appendix A) but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like an algorithm.
Open Source Code Yes The code used in the paper is available at https://github.com/sony/di4c.
Open Datasets Yes On CIFAR10 with a pixel-based discretized Gaussian diffusion... On Image Net class-conditional generation with masked diffusion... on masked diffusion language modeling with Open Web Text...
Dataset Splits No The paper uses well-known datasets like CIFAR-10, ImageNet, and Open Web Text, and refers to 'training dataset' and 'test data' in the context of evaluation. For ImageNet, it mentions '120K out of 1.28M Image Net training images were used' for finetuning and '50,000 samples (50 images for each Image Net class)' were generated for FID/IS evaluation against test data. For Open Web Text, it states 'We used 256 samples from the Web Text dataset... generated 5 continuations of 50 tokens'. However, the paper does not explicitly provide the specific percentages or counts for the overall dataset splits (e.g., train/validation/test) or explicitly cite predefined splits for reproducibility beyond naming the datasets and mentioning 'training' and 'test' data.
Hardware Specification Yes The teacher model was trained for 300 epochs with a minibatch size of 512 using eight A100 GPUs... Our finetuning used two A6000 GPUs with a minibatch size 4 (2 for each of two GPUs)
Software Dependencies No The paper mentions 'PyTorch-based implementation' multiple times and refers to the 'NLTK library' for Self-BLEU computation. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes For the Di4C finetuning, we followed the original setting in terms of the use of the Adam optimizer and the learning rate 2 10 4 as well as other hyperparameters. ... Each step used a minibatch of 128/L images... (F.1.3) For optimization, we followed the original implementation, i.e., the Adam W optimizer with a learning rate 10 5, (β1, β2) = (0.9, 0.96) and a weight decay 10 5. ... minibatch size of 512 ... minibatch size 4 ... λ-batch size of 32. It was trained for 30K iterations (F.2.4) We used the Adam optimizer (but with a learning rate of 3 10 5) with EMA (weight decay 0.9999) and did a constant warm-up (increasing the learning rate linearly for the first 500 iterations and setting it constant after that). For each experiment, the Di4C training (one round) was run for 100K iterations ... minibatch size was 2 ... λ-batch size was 16. (F.3.2)