reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment

Authors: Minh-Quan Le, Gaurav Mittal, Tianjian Meng, A S M Iftekhar, Vishwas Suryanarayanan, Barun Patra, Dimitris Samaras, Mei Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird s potential as a robust multimodal context-aligned image generator in complex visual tasks.
Researcher Affiliation	Collaboration	Minh-Quan Le1,2 , Gaurav Mittal1 , Tianjian Meng1, A S M Iftekhar1, Vishwas Suryanarayanan1, Barun Patra1, Dimitris Samaras2, Mei Chen1 1Microsoft, 2Stony Brook University
Pseudocode	Yes	Algorithm 1 Multimodal Context Rewards Fine-tuning
Open Source Code	No	Project page: https://roar-ai.github.io/hummingbird
Open Datasets	Yes	For the VQA benchmark, we fine-tune Hummingbird on VQAv2 (Goyal et al., 2017) and GQA (Hudson & Manning, 2019), then evaluate using TTA on MME Perception (Fu et al., 2024), a common benchmark for assessing SOTA MLLMs. Our benchmark covers MME Perception tasks related to Existence, Count, Position, Color, and Scene (more discussion on this in Appendix A). We further leverage Bongard Human-Object Interaction (HOI) (Jiang et al., 2022) dataset to perform Test-time Prompt Tuning (TPT) (Shu et al., 2022) to test a method s ability to maintain fidelity when focusing on sophisticated human-object interactions. Specifically, we fine-tune the UNet denoiser on the Image Net training set (Deng et al., 2009), and perform TPT (Shu et al., 2022) using real and generated images on the Image Net test set and four out-of-distribution (OOD) datasets: Image Net-A (Hendrycks et al., 2021b), Image Net-V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019).
Dataset Splits	Yes	For the VQA benchmark, we fine-tune Hummingbird on VQAv2 (Goyal et al., 2017) and GQA (Hudson & Manning, 2019), then evaluate using TTA on MME Perception (Fu et al., 2024)... For HOI Reasoning, we fine-tune Hummingbird on Bongard-HOI (Jiang et al., 2022) training set and evaluate on associated test sets... Specifically, we fine-tune the UNet denoiser on the Image Net training set (Deng et al., 2009), and perform TPT (Shu et al., 2022) using real and generated images on the Image Net test set and four out-of-distribution (OOD) datasets...
Hardware Specification	Yes	The fine-tuning is done on 8 NVIDIA A100 80GB GPUs using Adam W (Loshchilov & Hutter, 2019) optimizer, a learning rate of 5e-6, and gradient accumulation steps of 8.
Software Dependencies	No	We implement Hummingbird using Py Torch (Paszke et al., 2019) and Hugging Face diffusers (Face, 2023) libraries.
Experiment Setup	Yes	We perform Lo RA fine-tuning with 11M trainable parameters ( 0.46% of total 2.6B parameters) on 8 NVIDIA A100 80GB GPUs using Adam W (Loshchilov & Hutter, 2019) optimizer, a learning rate of 5e-6, and gradient accumulation steps of 8.