Large Language Models to Diffusion Finetuning

Authors: Edoardo Cetin, Tianyu Zhao, Yujin Tang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we provide descriptions for the implementation specifics, training, and evaluation of our new L2D method. Then, we present comprehensive quantitative results, evaluating the benefits of L2D across state-of-the-art LMs of different sizes from the Llama 3 (Dubey et al., 2024) and Qwen 2.5 families (Hui et al., 2024). Lastly, we focus on Llama 3.2 1B Instruct to study the properties of L2D in greater depth showing its complementarity to traditional finetuning and search approaches, and also pushing performance with further advances from the diffusion literature, such as adaptive ODE solvers and classifier-free guidance.
Researcher Affiliation Industry 1Sakana AI, Tokyo, Japan. Correspondence to: Edoardo Cetin <EMAIL>, Tianyu Zhao <EMAIL>, Yujin Tang <EMAIL>.
Pseudocode Yes Algorithm 1 Diffusion language modeling predictions
Open Source Code Yes We provide our full code1 to facilitate future advances in developing new scalable foundation models with diffusion. 1https://github.com/Sakana AI/L2D
Open Datasets Yes We evaluate L2D on challenging generation tasks broadly focused on math, coding, and general knowledge in a 5-shot setting... We consider the following tasks: GSM8K (Cobbe et al., 2021) and competition MATH (Hendrycks et al., 2021b) to evaluate mathematical reasoning; Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021b) for coding skills; together with MMLU (Hendrycks et al., 2021a) and MMLU-Pro (Wang et al., 2024) to assess knowledge retention.
Dataset Splits Yes In total, the produced training and validation datasets contain 892,283 and 46,848 examples, respectively. We evaluate L2D on challenging generation tasks broadly focused on math, coding, and general knowledge in a 5-shot setting. We choose to keep our evaluation consistent across all our tasks, without task-specific system prompts, sampling parameters, or involved answer extractions.
Hardware Specification No The paper mentions "multi-node settings" for RL training in Appendix D.7 for comparing with other methods, but does not specify the hardware used for the described experiments.
Software Dependencies No Lastly, we want to acknowledge the torchdiffeq (Chen, 2018) library, which we use in our implementation to compute the diffusion path with L2D. However, no specific version number is provided for torchdiffeq or any other software component.
Experiment Setup Yes We employ σ = 64 for the standard deviation of the base distribution p0, as the discrete nature of language makes token classification trivial for low noise levels and we want to regularize against the model s most influential diffusion steps being concentrated early on during inference. In all main results, we perform multi-step inference with a midpoint solver and 8 discretization levels, resulting in only 15 evaluations of fθd. Table 3. Implementation hyper-parameters of the weight finetuning baselines and L2D. (Includes Optimizer: Adam W, Warmup steps: 100, Maximum learning rate: 1e-5 / 1e-4, Final learning rate: 1e-6, Decay: Linear, Lo RA alpha: 64 / 32, Batch size: 32, Training epochs: 1, Maximum sequence length: 2048, Timestep training sampling t: Uniform, ODE solver: Midpoint, Total diffusion budget T: 15).