reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Authors: Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, Sang-Woo Lee

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model.
Researcher Affiliation	Collaboration	Jin-Hwa Kim NAVER AI Lab, SNU AIIS Republic of Korea EMAIL Yunji Kim Jiyoung Lee NAVER AI Lab Republic of Korea EMAIL Kang Min Yoo NAVER AI Lab, CLOVA, SNU AIIS Republic of Korea EMAIL Sang-Woo Lee NAVER CLOVA, AI Lab, KAIST AI Republic of Korea EMAIL
Pseudocode	No	No pseudocode or algorithm blocks are present in the paper.
Open Source Code	Yes	The code is available at https://github.com/naver-ai/mid.metric.
Open Datasets	Yes	COCO dataset [2], CUB [28, 29] and Flowers [30], Flickr8K-Expert [48], Flickr8K-CF [48], Pascal-50S [15], FOIL-COCO [40]
Dataset Splits	No	The paper evaluates existing models on established benchmarks and human judgment datasets, but does not explicitly detail the training/validation dataset splits used for the specific experimental setup of calculating their metric, beyond implicitly relying on standard benchmark practices for test sets or reference data.
Hardware Specification	No	The NAVER Smart Machine Learning (NSML) platform [50] has been used in the experiments.
Software Dependencies	Yes	We use the CLIP (Vi T-L/14) to extract image and text embedding vectors.
Experiment Setup	Yes	Without an explicit mention, we use the CLIP (Vi T-L/14) to extract image and text embedding vectors. Note that it is crucial to use double-precision for numerical stability. We found that λ of 5e-4 generally works across all benchmark evaluations, except for the FOIL benchmark where we used λ of 1e-15, which was slightly better. Note that we use an identical prompt 'A photo depicts' for all caption embeddings as employed in Ref CLIP-S [19].