reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

Authors: Jingkun An, Yinghao Zhu, Zongjian Li, Enshen Zhou, Haoran Feng, Xijie Huang, Bohua Chen, Yemin Shi, Chengwei Pan

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL-base, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPS v2 benchmark, consistently outperforming the base models.
Researcher Affiliation	Academia	1Beihang University, Beijing, China 2Peking University, Beijing, China 3Tsinghua University, Beijing, China 4Huazhong University of Science and Technology, Wuhan, Hubei, China 5Zhongguancun Laboratory, Beijing, China EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical equations (Eq. 1-9) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Project https://anjingkun.github.io/AGFSync
Open Datasets	Yes	1) TIFA (Hu et al. 2023): Based on the correct answers to a series of predefined questions. ... 2) HPS v2 (Wu et al. 2023): Human Preference Score v2 (HPS v2) is a benchmark designed to evaluate models capabilities across a variety of image types. ... MJHQ-30K is a benchmark dataset used for automatically evaluating the aesthetic quality of models (Li et al. 2024).
Dataset Splits	No	The paper uses established benchmarks like TIFA and HPS v2 but does not explicitly provide details about how these datasets were split into training, validation, or test sets for the experiments presented, nor does it refer to specific predefined splits.
Hardware Specification	No	The paper describes training parameters such as learning rate, batch size, output image size, and number of finetuning steps, but it does not specify any hardware details like GPU or CPU models used for the experiments.
Software Dependencies	No	The paper names several models and APIs used, such as Chat GPT (GPT-3.5), Gemini Pro, Salesforce/blip2-flan-t5-xxl, openai/clip-vit-base-patch16, Vila, and GPT-4 Vision, but it does not provide specific version numbers for underlying software components like programming languages or deep learning frameworks.
Experiment Setup	Yes	For each given text prompt c, we let the diffusion model generate N = 8 samples as backup images for preference dataset construction. In this process, we add Gaussian noise n N(0, σ2I) to the text embedding, where σ is set to 0.1. In the calculation of CLIP score, γ is set to 100, which leads to the CLIP Score range between 0 and 100. We also rescale the VQA score and aesthetic score to 0 100 by multiplying the original score by 100. The weighting of each score measurements is allocated as: w VQA = 0.35, w CLIP = 0.55, w Aesthetic = 0.1. ... For the SD v1.4 and SD v1.5 models, the learning rate is 5e-7, the batch size is 128, the output image size is 512 512. For the SDXL-base model, the learning rate is 1e-6, the batch size is 64, the output image size is 1024 1024. We finetune the diffusion model for 1,000 steps. The random seed is set to 200 in Fig. 3b and Fig. 3a.