reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

Authors: MUDE HUI, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Cihang Xie, Yuyin Zhou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study introduces HQ-Edit, a high-quality instruction-based image editing dataset with around 200,000 edits. ... we propose two evaluation metrics, Alignment and Coherence, to quantitatively assess the quality of image edit pairs using GPT-4V. ... HQ-Edit s high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing models. For example, an HQ-Edit finetuned Instruct Pix2Pix can attain state-of-the-art image editing performance, even surpassing those models fine-tuned with human-annotated data. ... 4 EXPERIMENTS ... 4.2 QUANTITATIVE EVALUATION ON GENERATED IMAGES ... 4.3 QUALITATIVE EVALUATION ... 4.4 ABLATION STUDY
Researcher Affiliation	Collaboration	Mude Hui1 , Siwei Yang1 , Bingchen Zhao2, Yichun Shi3, Heng Wang3, Peng Wang3, Cihang Xie1,Yuyin Zhou1 ... 1University of California, Santa Cruz 2University of Edinburgh 3Byte Dance
Pseudocode	Yes	We list all the prompts we used for data collection, including the EXPAND PROMPT used for the Expansion step; DIPTYCH PROMPT and REWRITE PROMPT used for the Generation step; and two metric prompt ALIGNMENT PROMPT and COHERENCE PROMPT for the evaluation. ... B.1 STEP #1: EXPANSION EXPAND PROMPT (GPT-4) ... B.2 STEP #2: GENERATION REWRITE PROMPT (GPT-4) DIPTYCH PROMPT (DALL-E 3) ... B.3 EVALUATION METRIC ALIGNMENT PROMPT (GPT-4V) COHERENCE PROMPT (GPT-4V)
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology described in this paper, nor does it provide a link to a code repository.
Open Datasets	Yes	Additionally, we include 90 samples from the Emu Edit (Sheynin et al., 2023) test set. ... Instruct Pix2Pix (Brooks et al., 2023) is the first instruction-based image editing model, by fine-tuning the Stable Diffusion (Rombach et al., 2022) on a dataset of image editing examples, which generated by GPT-3 (Brown et al., 2020) and Prompt2Prompt.
Dataset Splits	No	The paper mentions using a testing set of 293 samples for baselines and 500 randomly sampled data points for evaluations. However, it does not specify explicit training/validation splits or their sizes for the newly introduced HQ-Edit dataset (around 200,000 edits) or how the 293 samples relate to the overall dataset for reproducible partitioning.
Hardware Specification	Yes	During training, we set the image resolution to 512, total training steps to 15000 on 4 A100 GPUs, learning rate to 5e-5, and conditioning dropout prob to 0.05.
Software Dependencies	No	The paper mentions several models and tools used (e.g., GPT-4V, DALL-E 3, YOLOv8, DIFT, Gradio) but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch versions) required for reproducibility.
Experiment Setup	Yes	Implementation details. We choose Instruct Pix2Pix (Brooks et al., 2023) as our default model, and use HQ-Edit to fine-tune it. During training, we set the image resolution to 512, total training steps to 15000 on 4 A100 GPUs, learning rate to 5e-5, and conditioning dropout prob to 0.05. During the editing, we set the image guidance scale to 1.5, the instruct guidance scale to 7.0, and the number of inference steps to 20.