Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization
Authors: Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, Ge Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on tasks including target image generation, image compression, and text-image alignment demonstrate the effectiveness of our method, where our method achieves optimal policy convergence while allowing controllable trade-offs between reward maximization and diversity preservation. |
| Researcher Affiliation | Academia | Jiajun Fan1 , Shuaike Shen2 , Chaoran Cheng1, Yuxin Chen3 , Chumeng Liang4 , Ge Liu1 1 University of Illinois Urbana-Champaign, 2 Zhejiang University. 3 Tsinghua University, 4 University of Southern California. 1 EMAIL 2,3,4 EMAIL |
| Pseudocode | Yes | More experimental details can be found in App. D and E with our Pseudocode in App. H. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of *their own* implementation code. It states that their code is *built upon* existing open-source codebases like Torch CFM and Diffusers, but does not confirm their specific modifications or additions are publicly available. |
| Open Datasets | Yes | At first, we evaluate our ORW-CFM method, with and without W2 regularization, by fine-tuning a pre-trained model of MNIST (Le Cun et al., 1998) to generate only even numbers using reward signals. For the image compression task, we followed the reward function from DDPO (Black et al., 2024), where the goal was to either minimize the file size of the images after JPEG compression. The reward r(x) is proportional to the compression rate. We controlled the balance between compression and diversity using the regularization parameter α, which induces varying degrees of divergence between the fine-tuned model and the reference model. A lower α prioritizes compressibility, leading the model to generate images that occupy minimal storage space after compression, while higher α increases diversity in the generated outputs. |
| Dataset Splits | No | The paper mentions using MNIST and CIFAR-10 datasets and that they utilized existing unconditional flow matching training baselines from Torch CFM. However, it does not explicitly state the specific training, validation, or test splits (e.g., percentages or exact counts) used for these datasets in the main text or appendices. |
| Hardware Specification | Yes | All experiments were conducted on a system with one worker equipped with an 8-core CPU and, an NVIDIA A10 GPU, and memory of 32 GB. |
| Software Dependencies | No | The paper mentions several software components like Torch CFM, Diffusers, U-Net, and LoRA, and also specifies fp16 precision. However, it does not provide specific version numbers for these software dependencies, which are required for a reproducible description. |
| Experiment Setup | Yes | We have conducted detailed ablation experiments in the main text and appendix on the two most important hyperparameters in our method: the entropy control parameter τ and the W2 distance regularization parameter α. For all experiments, we use a batch size of 64 and employ a separate test process to evaluate our performance. |