FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Authors: Liyao Jiang, Negar Hassanpour, Mohammad Salameh, Mohan Sai Singamsetti, Fengyu Sun, Wei Lu, Di Niu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive evaluations, we show FRAP generates images with higher or comparable prompt-image alignment to prompts from complex datasets, while having a lower average latency compared to recent latent code optimization methods... We extensively evaluate the faithfulness, overall image quality, and image authenticity of FRAP-generated images via prompt-image alignment metrics, image quality assessment metrics, and an image authenticity metric.
Researcher Affiliation Collaboration 1Department of Electrical and Computer Engineering, University of Alberta 2Huawei Technologies Canada 3Huawei Kirin Solution, China EMAIL EMAIL EMAIL
Pseudocode Yes In Algorithm 1, we provide the detailed algorithm of our proposed FRAP method.
Open Source Code Yes We release the code at the following link: https://github.com/Liyao Jiang1998/FRAP/.
Open Datasets Yes We evaluate on three Simple, manually crafted prompt datasets from A&E: Animal-Animal (S-AA), Color Object (S-CO), and Animal-Object (S-AO); and five Complex datasets from D&B: Animal-Scene (C-AS), Color-Object-Scene (C-COS), Multi-Object (C-MO), COCO-Attribute (C-CA), and COCO-Subject (C-CS)... We adopt the validation set of the MS-COCO dataset (Lin et al., 2014)... We refer to this dataset as COCO-5K and will release this dataset to facilitate reproducibility and further research.
Dataset Splits No The paper uses various prompt datasets (Animal-Animal, Multi-Object, Animal-Object, Color-Object, Animal-Scene, COCO-Attribute, COCO-Subject, Color-Obj-Scene, COCO-5K, Draw Bench, ABC-6K) for evaluation but does not specify training/test/validation splits for any model trained or fine-tuned within the scope of this work. For instance, for COCO-5K, it describes how a subset was sampled: 'This filtering process selects 16k most relevant prompts from the original 40k prompts in MS-COCO, and we randomly sample a 5k subset from the 16k most relevant prompts.'
Hardware Specification Yes Our reported latency measures the average wall-clock time for generating one image on each dataset in seconds with a V100 GPU.
Software Dependencies No The paper mentions using 'Stable Diffusion 1.5' as the base model, 'FP16 precision', 'PNDM scheduler', and the 'spaCy language parser'. However, it does not provide specific version numbers for these software components or other key libraries like Python, PyTorch, or CUDA.
Experiment Setup Yes Following Chefer et al. (2023); Li et al. (2023), we use the 16 16 CA units for computing the objective function. The weight of object-modifier binding loss λ = 1. For the optimization in Eq. (9), we use a constant step-size ηt = η = 1. We apply our adaptive prompt weighting method to a subset of time-steps t = T, T 1, ..., tend, where T = 50 and tend = 26. For selecting the initial latent code, we perform 15 steps of inference from t = T = 50 to tselect = 36 with a batch of |B| = 4 noisy latent codes sampled from N(0, I). We use CFG guidance scale β = 7.5 and the Gaussian filter kernel size is 3 with a standard deviation of 0.5.