TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models

Authors: Leigang Qu, Haochuan Li, Tan Wang, Wenjie Wang, Yongqi Li, Liqiang Nie, Tat-Seng Chua

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on TIGe R-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority of our proposed framework. The code, models, and benchmark are available at https://tiger-t2i.github.io.
Researcher Affiliation Academia 1National University of Singapore 2Nanyang Technological University 3University of Science and Technology of China 4Hong Kong Polytechnic University 5Harbin Institute of Technology (Shenzhen)
Pseudocode No The paper describes methods in prose and does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code, models, and benchmark are available at https://tiger-t2i.github.io.
Open Datasets Yes To standardize the evaluation of unified text-to-image generation and retrieval, we construct TIGe R-Bench, a benchmark spanning both creative and knowledge-intensive domains. Extensive experiments on TIGe R-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority of our proposed framework. The code, models, and benchmark are available at https://tiger-t2i.github.io.
Dataset Splits Yes To evaluate text-to-image generation and retrieval, we prioritize selecting the original test split of each dataset to construct TIGe R-Bench. In cases where only a validation set is provided, we default to utilizing the validation set. ... We keep the ratio of 1 : 1 for creative and knowledge domains and collect 6,000 high-quality text-image pairs in total.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions software like SEED-LLa MA, La VIT, SDXL, CLIP, and LLMs but does not provide specific version numbers for any of these or other key software components.
Experiment Setup Yes We utilize the 8B version of SEED-LLa MA and load the parameters of supervised fine-tuning. For La VIT, we employ the 11B model with SDXL as the pixel decoder. ... The beam size for retrieval is set to 800, and the timestep for generation is 25.