reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring validation metrics for offline model-based optimisation with diffusion models

Authors: Christopher Beckham, Alexandre Piché, David Vazquez, Christopher Pal

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Exploring validation metrics for oﬄine model-based optimisation with diﬀusion models ... This involves proposing validation metrics and quantifying them over many datasets for which the ground truth is known, for instance simulated environments. ... we speciﬁcally evaluate denoising diﬀusion models ... we run a large scale study over diﬀerent hyperparameters and rank these validation metrics by their correlation with the ground truth. 4 Experiments and Discussion
Researcher Affiliation	Collaboration	Christopher Beckham ﬁrstEMAIL Mila, Polytechnique Montréal, Service Now Research Alexandre Piché ﬁrstEMAIL Mila, Université de Montréal, Service Now Research David Vazquez ﬁrstEMAIL Service Now Research Christopher Pal ﬁrstEMAIL Mila, Polytechnique Montréal, Service Now Research, CIFAR AI Chair
Pseudocode	Yes	Algorithm 1 Training algorithm, with early implicit early stopping. Algorithm 2 Final evaluation algorithm.
Open Source Code	Yes	1Corresponding code can be found here: https://github.com/christopher-beckham/validation-metrics-offline-mbo
Open Datasets	Yes	We explore ﬁve validation metrics in our work against four datasets in the Design Bench (Trabucco et al., 2022) framework, motivating their use as well as describing their advantages and disadvantages. Dataset Our codebase is built on top of the Design Bench (Trabucco et al., 2022) framework. We consider all continuous datasets in Design Bench datasets: Ant Morphology, D Kitty Morphology, Superconductor, and Hopper.
Dataset Splits	Yes	Table 2: Summary of datasets used in this work. ... One nuance with the Hopper dataset is that the full dataset D and the training set Dtrain are equivalent... To address this, we compute the median y with respect to Dtrain, and take the lower half as Dtrain and the upper half as Dvalid. ... Note that if a ground truth oracle exists, there is no need to deﬁne a Dtest, and this is the case for all datasets except Superconductor (Figure 3b). Otherwise, for Superconductor a random 50% subsample of (Dtrain \ D) is assigned to Dvalid (Figure 3c)...
Hardware Specification	Yes	Experiments are trained for 5000 epochs with single P-100 GPUs.
Software Dependencies	No	The paper mentions the ADAM optimiser (Kingma & Ba, 2014) and Hugging Face's annotated diﬀusion model, but it does not specify version numbers for these or any other software dependencies.
Experiment Setup	Yes	For all experiments we train with the ADAM optimiser (Kingma & Ba, 2014), with a learning rate of 2 10 5, β = (0.0, 0.9), and diﬀusion timesteps T = 200. Experiments are trained for 5000 epochs... Here we list hyperparameters that diﬀer between experiments: diffusion_kwargs.tau... gen_kwargs.dim... diffusion_kwargs.w_cg... Hyperparameters explored for classiﬁer-free guidance { diffusion_kwargs.tau : {0.05, 0.1, 0.2, 0.4, 0.5}, gen_kwargs.dim : {128, 256} } Hyperparameters explored for classiﬁer guidance { diffusion_kwargs.w_cg : {1.0, 10.0, 100.0}, epochs : {5000, 10000}, gen_kwargs.dim : {128, 256} }