Multi-Reward as Condition for Instruction-based Image Editing

Authors: Xin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, Sijie Zhu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments indicate that our multi-reward conditioned model outperforms its no-reward counterpart on two popular editing pipelines, i.e., Ins Pix2Pix and Smart Edit. Extensive experiments showing that the proposed method can be combined with existing editing models with a significant performance boost on all three perspectives, achieving state-of-the-art performance for both GPT-4o and human evaluation.
Researcher Affiliation Collaboration Xin Gu1,2 Ming Li1 Libo Zhang2,3 Fan Chen1 Longyin Wen1 Tiejian Luo2 Sijie Zhu1, 1Byte Dance Inc. 2University of Chinese Academy of Sciences 3Institute of Software Chinese Academy of Sciences EMAIL
Pseudocode No The paper describes methods using mathematical equations and structured steps in paragraph text within Section 4 ('METHODOLOGY'), but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Code is released at https://github.com/bytedance/Multi-Reward-Editing.
Open Datasets Yes The most widely used Ins Pix2Pix (Brooks et al., 2023) dataset is created with a pretrained text-to-image Stable Diffusion (SD) model (Rombach et al., 2022)... We first carefully selected 80 high-quality images from the Unsplash website as the original images... Code is released at https://github.com/bytedance/Multi-Reward-Editing. (The Unsplash dataset link is provided within the references implicitly: https://github.com/unsplash/datasets)
Dataset Splits Yes First, we randomly selected 20K training triplets from the Ins Pix2Pix dataset, where each triplet contains an original image, an edited image, and an editing instruction. To evaluate the editing models on real-world photos and diverse instructions covering major 7 categories (defined in Sec. 5), we create an evaluation set with 80 high-quality Unsplash (uns) photos and 560 challenging instructions, which are initially generated by GPT-4o and verified by human annotators.
Hardware Specification No No specific hardware details such as GPU model, CPU, or memory were provided for running the experiments.
Software Dependencies No Our method is implemented in Python using Py Torch. This statement does not include specific version numbers for Python, PyTorch, or any other libraries used.
Experiment Setup Yes During training, we only optimize the MRC module, the U-Net module, the reward encoder, and the connected linear layers. And we use the Adam (Kingma, 2014) optimizer with an initial learning rate of 5e 5, a weight decay of 1e 2, and a warm-up ratio of 0. We resize the images to 256 and apply random cropping during training and resize the shorter side to 512 during inference.