Stochastic Control for Fine-tuning Diffusion Models: Optimality, Regularity, and Convergence

Authors: Yinbin Han, Meisam Razaviyayn, Renyuan Xu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the performance of the PI-FT algorithm from Section 3 via numerical experiments, focusing on the following questions: In practice, how fast does the PI-FT algorithm converge to the optimal solution? How does the choice of β affect the convergence rate and the quality of the fine-tuned models? As shown in this section, the PI-FT algorithm converges efficiently to the global optimum; increasing β accelerates convergence and yields a model closer to the pre-trained one, aligning with our theoretical analysis in Section 3. Model Setup. We fine-tune the Stable Diffusion v1.5 (Rombach et al., 2022) for text-to-image generation, using Lo RA (Hu et al., 2022) and Image Reward (Xu et al., 2023). Following (Fan et al., 2024), we use four prompts A green colored rabbit, A cat and a dog, Four wolves in the park, and A dog on the moon to evaluate the model s ability to generate correct color, composition, counting, and location, respectively. During training, we generate 10 trajectories, each consisting of 50 transitions, to calculate the gradient with 1000 gradient steps. By default, we use the Adam W optimizer with a learning rate of 3 10 4 , and set the KL regularization coefficient to a fixed value as β = 0.01.
Researcher Affiliation Academia 1Department of Finance and Risk Engineering, New York University 2Daniel J. Epstein Department of Industrial and Systems Engineering, University of Southern California. Correspondence to: Renyuan Xu <EMAIL>.
Pseudocode Yes Algorithm 1 Policy Iteration for Fine-Tuning (PI-FT) 1: Input: Expected reward function r( ), pre-trained model {spre t }T t=0, and number of iterations {mt}T 1 t=0 . 2: Set V (m T ) T (y) = r(y) for all y Rd. 3: for t = T 1, . . . , 0 do 4: Set u(0) t (y) = spre t (y). 5: for m = 1, . . . , mt 1 do 6: Update the control using u(m+1) t (y) = αtσ2 t (1 αt)βt E h V (mt+1) t+1 y(m) i + spre t (y). (18) where y(m) = 1 αt (y +(1 αt)u(m) t (y))+σt Wt 7: end for 8: Compute the value function V (mt) t using V (mt) t (y) = E V (mt+1) t+1 y(mt) + ℓ=t βℓ (1 αℓ)2 2αℓσ2 ℓ u(mt) t (y) spre t (y) 2 2 ! . 10: return n u(mt) t o T 1 t=0 and n V (mt) t o T
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a direct link to a code repository. Citations to third-party tools or general project overviews are not sufficient.
Open Datasets No The paper mentions fine-tuning Stable Diffusion v1.5 and evaluating with Image Reward. While these are well-known models/metrics, the paper describes using "a small sample set with human feedback" for fine-tuning without providing concrete access information (link, DOI, specific citation with author/year for the dataset itself) for this particular dataset used in their experiments. Therefore, the dataset used for fine-tuning their specific experiments is not made publicly accessible.
Dataset Splits No The paper states, "During training, we generate 10 trajectories, each consisting of 50 transitions, to calculate the gradient with 1000 gradient steps." This describes the generation process and training steps rather than a specific division of a pre-existing dataset into training, validation, or testing splits.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for running the experiments. It only mentions fine-tuning Stable Diffusion v1.5, which is a software model.
Software Dependencies No The paper mentions using the "AdamW optimizer" and references "Stable Diffusion v1.5" and "LoRA", but it does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers that would be required to reproduce the experiments. The optimizer name alone is not sufficient.
Experiment Setup Yes During training, we generate 10 trajectories, each consisting of 50 transitions, to calculate the gradient with 1000 gradient steps. By default, we use the Adam W optimizer with a learning rate of 3 10 4 , and set the KL regularization coefficient to a fixed value as β = 0.01. For a fair comparison, we configure DPOK to perform 10 gradient steps per sampling step, using a learning rate of 1 10 5. Each gradient step is computed using 50 randomly sampled transitions from a replay buffer.