reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Extract Free Dense Misalignment from CLIP

Authors: JeongYeon Nam, Jinbae Im, Wonjae Kim, Taeho Kil

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on various dense misalignment detection benchmarks, covering various image and text domains and misalignment types. Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models while maintaining superior efficiency. Our qualitative examples show that our method has a unique strength to detect entity-level objects, intangible objects, and attributes that can not be easily detected for existing works. We conduct ablation studies and analyses to highlight the strengths and limitations of our approach.
Researcher Affiliation	Industry	Jeong Yeon Nam1, Jinbae Im1, Wonjae Kim2, Taeho Kil1 1 NAVER Cloud AI 2 NAVER AI Lab Corresponding author: EMAIL
Pseudocode	No	The paper describes the method using text and mathematical equations in the 'Method' section, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/naver-ai/CLIP4DM
Open Datasets	Yes	We thoroughly evaluate our method on diverse dense misalignment detection benchmarks (FOIL (Shekhar et al. 2017), nocaps-FOIL (Petryk et al. 2024), HAT (Petryk et al. 2024), See TRUE-Feedback (Gordon et al. 2024), and Rich HF (Liang et al. 2024)), encompassing various text, image, and misalignment types.
Dataset Splits	Yes	We set our hyperparameters by searching the development set of Rich-HF and a subset of the training set from the FOIL dataset. We report two variants of CLIP: Open AI CLIP Vi TB/32 (Radford et al. 2021), following Hessel et al. (2021), and Vi T-H/14 trained on LAION-2B (Schuhmann et al. 2022) from Open Clip (Cherti et al. 2023), which yields our best score. For detailed information about the datasets and experiments on additional benchmarks, please refer to the supplementary materials. FOIL and nocaps-FOIL. FOIL (Shekhar et al. 2017) and nocaps-FOIL (Petryk et al. 2024) are benchmarks for detecting misaligned captions... In nocaps-FOIL, we report results as in-domain, near-domain, or out-of-domain based on how similar the altered objects are to COCO object classes. HAT. The HAT dataset (Petryk et al. 2024) comprises 400 human-annotated samples... See TRUE-Feedback. See TRUE-Feedback (Gordon et al. 2024) comprises a test set of 2K samples... Rich-HF. Rich-HF (Liang et al. 2024) comprises 955 prompt and image pairs with word-level misalignment annotations and overall alignment score.
Hardware Specification	Yes	Frames-Per-Second (FPS) is measured with a single V100.
Software Dependencies	No	The paper refers to models and datasets like Open AI CLIP Vi TB/32 and BARTNLI model with citations, but it does not specify software library version numbers (e.g., PyTorch version, CUDA version, or specific versions of any other key dependencies).
Experiment Setup	Yes	We set our hyperparameters by searching the development set of Rich-HF and a subset of the training set from the FOIL dataset. We use l to 10 and 22 for Vi T-B/32 and Vi T-H/14, respectively, utilizing the final three layers in both cases. Unless otherwise specified, ϵ is set to -0.00005. Frames-Per-Second (FPS) is measured with a single V100. Finally, we use F-CLIPScore for the global misalignment classification task.