TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models
Authors: Leigang Qu, Haochuan Li, Tan Wang, Wenjie Wang, Yongqi Li, Liqiang Nie, Tat-Seng Chua
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on TIGe R-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority of our proposed framework. The code, models, and benchmark are available at https://tiger-t2i.github.io. |
| Researcher Affiliation | Academia | 1National University of Singapore 2Nanyang Technological University 3University of Science and Technology of China 4Hong Kong Polytechnic University 5Harbin Institute of Technology (Shenzhen) |
| Pseudocode | No | The paper describes methods in prose and does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code, models, and benchmark are available at https://tiger-t2i.github.io. |
| Open Datasets | Yes | To standardize the evaluation of unified text-to-image generation and retrieval, we construct TIGe R-Bench, a benchmark spanning both creative and knowledge-intensive domains. Extensive experiments on TIGe R-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority of our proposed framework. The code, models, and benchmark are available at https://tiger-t2i.github.io. |
| Dataset Splits | Yes | To evaluate text-to-image generation and retrieval, we prioritize selecting the original test split of each dataset to construct TIGe R-Bench. In cases where only a validation set is provided, we default to utilizing the validation set. ... We keep the ratio of 1 : 1 for creative and knowledge domains and collect 6,000 high-quality text-image pairs in total. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions software like SEED-LLa MA, La VIT, SDXL, CLIP, and LLMs but does not provide specific version numbers for any of these or other key software components. |
| Experiment Setup | Yes | We utilize the 8B version of SEED-LLa MA and load the parameters of supervised fine-tuning. For La VIT, we employ the 11B model with SDXL as the pixel decoder. ... The beam size for retrieval is set to 800, and the timestep for generation is 25. |