Mutual Information Divergence: A Unified Metric for Multimodal Generative Models
Authors: Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, Sang-Woo Lee
NeurIPS 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. |
| Researcher Affiliation | Collaboration | Jin-Hwa Kim NAVER AI Lab, SNU AIIS Republic of Korea EMAIL Yunji Kim Jiyoung Lee NAVER AI Lab Republic of Korea EMAIL Kang Min Yoo NAVER AI Lab, CLOVA, SNU AIIS Republic of Korea EMAIL Sang-Woo Lee NAVER CLOVA, AI Lab, KAIST AI Republic of Korea EMAIL |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | The code is available at https://github.com/naver-ai/mid.metric. |
| Open Datasets | Yes | COCO dataset [2], CUB [28, 29] and Flowers [30], Flickr8K-Expert [48], Flickr8K-CF [48], Pascal-50S [15], FOIL-COCO [40] |
| Dataset Splits | No | The paper evaluates existing models on established benchmarks and human judgment datasets, but does not explicitly detail the training/validation dataset splits used for the specific experimental setup of calculating their metric, beyond implicitly relying on standard benchmark practices for test sets or reference data. |
| Hardware Specification | No | The NAVER Smart Machine Learning (NSML) platform [50] has been used in the experiments. |
| Software Dependencies | Yes | We use the CLIP (Vi T-L/14) to extract image and text embedding vectors. |
| Experiment Setup | Yes | Without an explicit mention, we use the CLIP (Vi T-L/14) to extract image and text embedding vectors. Note that it is crucial to use double-precision for numerical stability. We found that λ of 5e-4 generally works across all benchmark evaluations, except for the FOIL benchmark where we used λ of 1e-15, which was slightly better. Note that we use an identical prompt 'A photo depicts' for all caption embeddings as employed in Ref CLIP-S [19]. |