DRAGON: Distributional Rewards Optimize Diffusion Generative Models
Authors: Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, Nicholas J. Bryan
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For evaluation, we fine-tune an audio-domain textto-music diffusion model with 20 reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Fréchet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. |
| Researcher Affiliation | Collaboration | Yatong Bai EMAIL University of California, Berkeley Jonah Casebeer EMAIL Adobe Research Somayeh Sojoudi EMAIL University of California, Berkeley Nicholas J. Bryan EMAIL Adobe Research |
| Pseudocode | Yes | Please also find pseudocode in Appendix F. |
| Open Source Code | No | The paper provides a link for example generations: https://ml-dragon.github.io/web. This is a demonstration page and not an explicit statement or link for the source code of the methodology described in the paper. |
| Open Datasets | Yes | Our evaluation uses a combination of the captions in an independent ALIM test split (800 pieces), the captions in a non-vocal Song Describer subset (Manco et al., 2023) (585 pieces, abbreviated as SDNV), and the real-world user prompts in the DMA dataset (800 pieces). |
| Dataset Splits | Yes | To verify the performance of the aesthetics predictor and perform ablation studies, we split the DMA dataset into train/validation subsets by an 85/15 ratio. |
| Hardware Specification | Yes | Pre-training of the baseline diffusion model lasted five days across 32 Nvidia A100 GPUs with a total batch size of 256 and a learning rate of 10 4 with Adam. |
| Software Dependencies | No | The paper mentions software components like 'FLAN-T5-based text encoder' and 'LAION-CLAP checkpoint' and implies the use of Python for pseudocode, but it does not provide specific version numbers for any key software components or libraries used. |
| Experiment Setup | Yes | The diffusion hyperparameter design follows EDM (Karras et al., 2022), with σdata = 0.5, Pmean = 0.4, Pstd = 1.0, σmax = 80, and σmin = 0.002. Also following EDM, we apply a logarithmic transformation to the noise levels, followed by sinusoidal embeddings. These processed noise-level embeddings are then combined and integrated into the Di T block through an adaptive layer normalization block. For text conditioning, we concatenate the T5-embedded text tokens with audio tokens at each attention layer. As a result, the audio token query attends to a concatenated sequence of audio and text keys, enabling the model to jointly extract relevant information from both modalities. Pre-training of the baseline diffusion model lasted five days across 32 Nvidia A100 GPUs with a total batch size of 256 and a learning rate of 10 4 with Adam. All DRAGON fine-tuning is performed on top of the baseline model introduced above on four or eight A100 GPUs with a total batch size of 80, with the baseline model used as the reference model fref required by the loss functions, as defined in the Aθ term in (7). We use Adam with a fixed learning rate of 3 10 6 and a gradient clip of 45 (determined via gradient logging). For the DPO loss (6) and the KTO loss (8), we select β = 5000 following (Wallace et al., 2024) and do not update fref. All training runs (pre-training and DRAGON) use a 10% condition dropout to enhance classifier-free guidance (CFG). The online audio demonstrations in DRAGON are generated with the default inference setting (40 diffusion steps with the second-order DPM sampler with CFG++ enabled in selected time steps). We use fθ to produce the demonstrations with probability 0.9 and use fref with probability 0.1. |