RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning
Authors: Yuanhuiyi Lyu, Xu Zheng, Lutao Jiang, Yibo Yan, Xin Zou, Huiyu Zhou, Linfeng Zhang, Xuming Hu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposed Real RAG on three fine-grained real-world image datasets, including Stanford Cars (Krause et al., 2013), Stanford Dogs (Dataset, 2011), and Oxford Flowers (Nilsback & Zisserman, 2008). We use the test sets of these datasets to validate the realism of the generated images (Sec. 4.2). Furthermore, to validate the ability of Real RAG to generate unseen novel objects, we also test our model on recently introduced novel objects (Sec.4.3). We employ FID, CLIP-T, and CLIP-I to compare the visual quality and realism of generated images from different methods. |
| Researcher Affiliation | Academia | 1The Hong Kong University of Science and Technology (Guangzhou) 2Guangxi Zhuang Autonomous Region Big Data Research Institute 3Shanghai Jiao Tong University 4The Hong Kong University of Science and Technology. Correspondence to: Xuming Hu <EMAIL>. |
| Pseudocode | No | The paper describes its methodology in Section 3, titled 'Methodology', using textual descriptions and mathematical formulas. However, it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project page: https://qc-ly.github.io/Real RAG-page/ (Upon checking the project page, it states 'Our code will be publicly available.', indicating a future release rather than immediate availability.) |
| Open Datasets | Yes | We collect our real-object-based database from wide-use real-world datasets, including Image Net (Deng et al., 2009), Stanford Cars (Krause et al., 2013), Stanford Dogs (Dataset, 2011), and Oxford Flowers (Nilsback & Zisserman, 2008). |
| Dataset Splits | Yes | We collect our real-object-based database from wide-use real-world datasets, including Image Net (Deng et al., 2009), Stanford Cars (Krause et al., 2013), Stanford Dogs (Dataset, 2011), and Oxford Flowers (Nilsback & Zisserman, 2008). We use the training set of these datasets to conduct our database. We use the test sets of these datasets to validate the realism of the generated images (Sec. 4.2). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory amounts, or detailed computer specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using the 'pre-trained CLIP model (Radford et al., 2021)' and various state-of-the-art text-to-image generative models (e.g., SD V2.1, SD XL, SD V3, Flux, Omni Gen, Emu). However, it does not provide specific version numbers for these software components or other ancillary software dependencies like programming languages or libraries. |
| Experiment Setup | No | The paper mentions 'τ is a temperature hyperparameter' in the loss function (Equation 8), but its specific value is not provided. It lacks concrete details on other experimental setup aspects such as learning rate, batch size, number of epochs, optimizer settings, or explicit training configurations. |