reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distributional Diffusion Models with Scoring Rules

Authors: Valentin De Bortoli, Alexandre Galashov, J Swaroop Guntupalli, Guangyao Zhou, Kevin Patrick Murphy, Arthur Gretton, Arnaud Doucet

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments in Section 6 demonstrate that our approach produces high-quality samples with significantly fewer denoising steps. We observe substantial benefits across a range of tasks, including image-generation tasks in both pixel and latent spaces, and in robotics applications.
Researcher Affiliation	Collaboration	1Google Deep Mind 2Gatsby UCL. Correspondence to: Valentin De Bortoli <EMAIL>, Alexandre Galashov <EMAIL>.
Pseudocode	Yes	Algorithm 1 Distributional Diffusion Model (training) Require: M training steps, schedule (αt, σt), distribution p0, weights θ0, batch size n, population size m for k = 1 : M do Sample ti Unif([0, 1]) for i [n] Sample Xi 0 i.i.d. p0 for i [n] Sample Xi ti pti\|0( \|Xi 0) for i [n] using (2) Sample ξi,j N(0, Id) for i [n], j [m] Set θk = θk 1 δ Ln,m(θ)\|θk 1 using (14) end for Return: θM
Open Source Code	No	The paper does not provide explicit statements about releasing source code for the methodology, nor does it include links to a code repository. Appendix G contains pseudocode, but this is not equivalent to open-source code.
Open Datasets	Yes	We train conditional pixel-space models on CIFAR-10 (32x32x3) and on Celeb A (64x64x3), as well as unconditional pixel-space models on LSUN Bedrooms (64x64x3). We further use an autoencoder trained on Celeb A-HQ (256x256x3)... We experiment with diffusion policies on the Libero (Liu et al., 2024) benchmark... We ran initial experiments on Image Net (Russakovsky et al., 2015) with resolution 64 64 3
Dataset Splits	No	The paper mentions that for the 2D experiments, 4096 samples were used for evaluation, and for image experiments, 50000 samples for CIFAR-10, LSUN Bedrooms, Celeb A, and 30000 for latent Celeb A-HQ were used for final FID computation. For robotics, Libero10 had 138090 steps of training data. However, it does not explicitly provide percentages or specific quantities for training, validation, and test splits for any of these datasets, nor does it cite the methodology for such splits.
Hardware Specification	Yes	As hardware, we use A100 GPU (40Gb of memory) with batch size = 16, H100 GPU (80Gb of memory) with batch size = 64 and TPUv5p (95Gb of memory) with batch size = 64 (per device with 4 devices in total).
Software Dependencies	No	The paper mentions using 'Adam optimizer', 'cosine warmup', 'JAX', 'Bert encoder', and 'Res Net' but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup	Yes	We train diffusion model and distributional diffusion models for 100k steps with batch size 128 with learning rate 1e 3 using Adam optimizer with a cosine warmup for first 100 iterations. We use b1 = 0.9, b2 = 0.999, ϵ = 1e 8 in Adam optimizer. On top of that, we clip the updates by their global norm (with the maximal norm being 1). We use EMA decay of 0.99. We use the flow matching noise schedule (3) and we use safety epsilon 1e 2... For distributional models, we additionally sweep over λ {0, 0.1, 0.5, 1.0}, β {0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 1.5, 1.99, 2.0}. We use population size m = 4.