Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment
Authors: Minh-Quan Le, Gaurav Mittal, Tianjian Meng, A S M Iftekhar, Vishwas Suryanarayanan, Barun Patra, Dimitris Samaras, Mei Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird s potential as a robust multimodal context-aligned image generator in complex visual tasks. |
| Researcher Affiliation | Collaboration | Minh-Quan Le1,2 , Gaurav Mittal1 , Tianjian Meng1, A S M Iftekhar1, Vishwas Suryanarayanan1, Barun Patra1, Dimitris Samaras2, Mei Chen1 1Microsoft, 2Stony Brook University |
| Pseudocode | Yes | Algorithm 1 Multimodal Context Rewards Fine-tuning |
| Open Source Code | No | Project page: https://roar-ai.github.io/hummingbird |
| Open Datasets | Yes | For the VQA benchmark, we fine-tune Hummingbird on VQAv2 (Goyal et al., 2017) and GQA (Hudson & Manning, 2019), then evaluate using TTA on MME Perception (Fu et al., 2024), a common benchmark for assessing SOTA MLLMs. Our benchmark covers MME Perception tasks related to Existence, Count, Position, Color, and Scene (more discussion on this in Appendix A). We further leverage Bongard Human-Object Interaction (HOI) (Jiang et al., 2022) dataset to perform Test-time Prompt Tuning (TPT) (Shu et al., 2022) to test a method s ability to maintain fidelity when focusing on sophisticated human-object interactions. Specifically, we fine-tune the UNet denoiser on the Image Net training set (Deng et al., 2009), and perform TPT (Shu et al., 2022) using real and generated images on the Image Net test set and four out-of-distribution (OOD) datasets: Image Net-A (Hendrycks et al., 2021b), Image Net-V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019). |
| Dataset Splits | Yes | For the VQA benchmark, we fine-tune Hummingbird on VQAv2 (Goyal et al., 2017) and GQA (Hudson & Manning, 2019), then evaluate using TTA on MME Perception (Fu et al., 2024)... For HOI Reasoning, we fine-tune Hummingbird on Bongard-HOI (Jiang et al., 2022) training set and evaluate on associated test sets... Specifically, we fine-tune the UNet denoiser on the Image Net training set (Deng et al., 2009), and perform TPT (Shu et al., 2022) using real and generated images on the Image Net test set and four out-of-distribution (OOD) datasets... |
| Hardware Specification | Yes | The fine-tuning is done on 8 NVIDIA A100 80GB GPUs using Adam W (Loshchilov & Hutter, 2019) optimizer, a learning rate of 5e-6, and gradient accumulation steps of 8. |
| Software Dependencies | No | We implement Hummingbird using Py Torch (Paszke et al., 2019) and Hugging Face diffusers (Face, 2023) libraries. |
| Experiment Setup | Yes | We perform Lo RA fine-tuning with 11M trainable parameters ( 0.46% of total 2.6B parameters) on 8 NVIDIA A100 80GB GPUs using Adam W (Loshchilov & Hutter, 2019) optimizer, a learning rate of 5e-6, and gradient accumulation steps of 8. |