reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A First-order Generative Bilevel Optimization Framework for Diffusion Models

Authors: Quan Xiao, Hui Yuan, A F M Saif, Gaowen Liu, Ramana Rao Kompella, Mengdi Wang, Tianyi Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present the experimental results of the proposed bilevel-diffusion algorithms in two applications: reward fine-tuning and noise scheduling for diffusion models, and compare them with baseline hyperparameter optimization methods: grid search, random search, and Bayesian search (Snoek et al., 2012). Table 1 presents the average FID, CLIP score, and execution time for each method over prompts.
Researcher Affiliation	Collaboration	1Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 2Department of Electrical and Computer Engineering, Cornell Tech, Cornell University, New York, NY 3Department of Electrical and Computer Engineering, Princeton University, NJ 4Cisco Research.
Pseudocode	Yes	Algorithm 1 A meta generative bilevel algorithm Algorithm 2 Bilevel approach with pre-trained model Algorithm 3 Score network training Algorithm 4 Backward sampling Algorithm 5 Guided Diffusion for Generative Optimization Algorithm 6 Bilevel Approach without Pre-trained Diffusion Model
Open Source Code	Yes	Experiments demonstrate that our method outperforms existing finetuning and hyperparameter search baselines. Our code has been released at https://github. com/afmsaif/bilevel_diffusion.
Open Datasets	Yes	We evaluated our bilevel noise scheduling method, detailed in Algorithm 6, paired with DDIM backward sampling for the image generation on the MNIST dataset. For this experiment, we use the Stable Diffusion V1.5 model as our pre-trained model and employ a Res Net-18 architecture (trained on the Image Net dataset) as the synthetic (lower-level) reward model
Dataset Splits	No	The paper mentions using the MNIST dataset and generated images for evaluation, but it does not specify any explicit training/test/validation splits (e.g., percentages or sample counts) for these datasets within the text. It implies the use of standard datasets but does not detail their partitioning.
Hardware Specification	Yes	All experiments were conducted on two servers: one with four NVIDIA A6000 GPUs, and 256 GB of RAM; one with an Intel i9-9960X CPU and two NVIDIA A5000 GPUs.
Software Dependencies	Yes	Although it is possible to obtain the gradient of θLSQ( um θ,q) using Py Torch s auto-differentiation, it requires differentiating through the backward sampling trajectory. M. pytorch-fid: FID Score for Py Torch. https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.3.0.
Experiment Setup	Yes	We use a batch size of 3 for the fine-tuning step and set optimization step 7 and repeat the optimization for 4 times. We use a batch size of 128 and choose the number of inner loop Sz for θz updates as 1. Empirically, we found that, at the beginning of the training process (i.e. when k = 0), the number of inner loop S0 y for updating θy should be larger to obtain a relatively reasonable U-Net, but later on, we do not need large inner loop, i.e. we set Sk y = 10 for k 1. We formalize this stage as initial epoch, where we traverse every batch and set S0 y = 20. We choose the ZO perturbation amount as ν = 0.01.