Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment

Authors: Minh-Quan Le, Gaurav Mittal, Tianjian Meng, A S M Iftekhar, Vishwas Suryanarayanan, Barun Patra, Dimitris Samaras, Mei Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird s potential as a robust multimodal context-aligned image generator in complex visual tasks.
Researcher Affiliation Collaboration Minh-Quan Le1,2 , Gaurav Mittal1 , Tianjian Meng1, A S M Iftekhar1, Vishwas Suryanarayanan1, Barun Patra1, Dimitris Samaras2, Mei Chen1 1Microsoft, 2Stony Brook University
Pseudocode Yes Algorithm 1 Multimodal Context Rewards Fine-tuning
Open Source Code No Project page: https://roar-ai.github.io/hummingbird
Open Datasets Yes For the VQA benchmark, we fine-tune Hummingbird on VQAv2 (Goyal et al., 2017) and GQA (Hudson & Manning, 2019), then evaluate using TTA on MME Perception (Fu et al., 2024), a common benchmark for assessing SOTA MLLMs. Our benchmark covers MME Perception tasks related to Existence, Count, Position, Color, and Scene (more discussion on this in Appendix A). We further leverage Bongard Human-Object Interaction (HOI) (Jiang et al., 2022) dataset to perform Test-time Prompt Tuning (TPT) (Shu et al., 2022) to test a method s ability to maintain fidelity when focusing on sophisticated human-object interactions. Specifically, we fine-tune the UNet denoiser on the Image Net training set (Deng et al., 2009), and perform TPT (Shu et al., 2022) using real and generated images on the Image Net test set and four out-of-distribution (OOD) datasets: Image Net-A (Hendrycks et al., 2021b), Image Net-V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019).
Dataset Splits Yes For the VQA benchmark, we fine-tune Hummingbird on VQAv2 (Goyal et al., 2017) and GQA (Hudson & Manning, 2019), then evaluate using TTA on MME Perception (Fu et al., 2024)... For HOI Reasoning, we fine-tune Hummingbird on Bongard-HOI (Jiang et al., 2022) training set and evaluate on associated test sets... Specifically, we fine-tune the UNet denoiser on the Image Net training set (Deng et al., 2009), and perform TPT (Shu et al., 2022) using real and generated images on the Image Net test set and four out-of-distribution (OOD) datasets...
Hardware Specification Yes The fine-tuning is done on 8 NVIDIA A100 80GB GPUs using Adam W (Loshchilov & Hutter, 2019) optimizer, a learning rate of 5e-6, and gradient accumulation steps of 8.
Software Dependencies No We implement Hummingbird using Py Torch (Paszke et al., 2019) and Hugging Face diffusers (Face, 2023) libraries.
Experiment Setup Yes We perform Lo RA fine-tuning with 11M trainable parameters ( 0.46% of total 2.6B parameters) on 8 NVIDIA A100 80GB GPUs using Adam W (Loshchilov & Hutter, 2019) optimizer, a learning rate of 5e-6, and gradient accumulation steps of 8.