reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large Language Models to Diffusion Finetuning

Authors: Edoardo Cetin, Tianyu Zhao, Yujin Tang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we provide descriptions for the implementation specifics, training, and evaluation of our new L2D method. Then, we present comprehensive quantitative results, evaluating the benefits of L2D across state-of-the-art LMs of different sizes from the Llama 3 (Dubey et al., 2024) and Qwen 2.5 families (Hui et al., 2024). Lastly, we focus on Llama 3.2 1B Instruct to study the properties of L2D in greater depth showing its complementarity to traditional finetuning and search approaches, and also pushing performance with further advances from the diffusion literature, such as adaptive ODE solvers and classifier-free guidance.
Researcher Affiliation	Industry	1Sakana AI, Tokyo, Japan. Correspondence to: Edoardo Cetin <EMAIL>, Tianyu Zhao <EMAIL>, Yujin Tang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Diffusion language modeling predictions
Open Source Code	Yes	We provide our full code1 to facilitate future advances in developing new scalable foundation models with diffusion. 1https://github.com/Sakana AI/L2D
Open Datasets	Yes	We evaluate L2D on challenging generation tasks broadly focused on math, coding, and general knowledge in a 5-shot setting... We consider the following tasks: GSM8K (Cobbe et al., 2021) and competition MATH (Hendrycks et al., 2021b) to evaluate mathematical reasoning; Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021b) for coding skills; together with MMLU (Hendrycks et al., 2021a) and MMLU-Pro (Wang et al., 2024) to assess knowledge retention.
Dataset Splits	Yes	In total, the produced training and validation datasets contain 892,283 and 46,848 examples, respectively. We evaluate L2D on challenging generation tasks broadly focused on math, coding, and general knowledge in a 5-shot setting. We choose to keep our evaluation consistent across all our tasks, without task-specific system prompts, sampling parameters, or involved answer extractions.
Hardware Specification	No	The paper mentions "multi-node settings" for RL training in Appendix D.7 for comparing with other methods, but does not specify the hardware used for the described experiments.
Software Dependencies	No	Lastly, we want to acknowledge the torchdiffeq (Chen, 2018) library, which we use in our implementation to compute the diffusion path with L2D. However, no specific version number is provided for torchdiffeq or any other software component.
Experiment Setup	Yes	We employ σ = 64 for the standard deviation of the base distribution p0, as the discrete nature of language makes token classification trivial for low noise levels and we want to regularize against the model s most influential diffusion steps being concentrated early on during inference. In all main results, we perform multi-step inference with a midpoint solver and 8 discretization levels, resulting in only 15 evaluations of fθd. Table 3. Implementation hyper-parameters of the weight finetuning baselines and L2D. (Includes Optimizer: Adam W, Warmup steps: 100, Maximum learning rate: 1e-5 / 1e-4, Final learning rate: 1e-6, Decay: Linear, Lo RA alpha: 64 / 32, Batch size: 32, Training epochs: 1, Maximum sequence length: 2048, Timestep training sampling t: Uniform, ODE solver: Midpoint, Total diffusion budget T: 15).