Extract Free Dense Misalignment from CLIP
Authors: JeongYeon Nam, Jinbae Im, Wonjae Kim, Taeho Kil
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on various dense misalignment detection benchmarks, covering various image and text domains and misalignment types. Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models while maintaining superior efficiency. Our qualitative examples show that our method has a unique strength to detect entity-level objects, intangible objects, and attributes that can not be easily detected for existing works. We conduct ablation studies and analyses to highlight the strengths and limitations of our approach. |
| Researcher Affiliation | Industry | Jeong Yeon Nam1*, Jinbae Im1, Wonjae Kim2, Taeho Kil1 1 NAVER Cloud AI 2 NAVER AI Lab *Corresponding author: EMAIL |
| Pseudocode | No | The paper describes the method using text and mathematical equations in the 'Method' section, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/naver-ai/CLIP4DM |
| Open Datasets | Yes | We thoroughly evaluate our method on diverse dense misalignment detection benchmarks (FOIL (Shekhar et al. 2017), nocaps-FOIL (Petryk et al. 2024), HAT (Petryk et al. 2024), See TRUE-Feedback (Gordon et al. 2024), and Rich HF (Liang et al. 2024)), encompassing various text, image, and misalignment types. |
| Dataset Splits | Yes | We set our hyperparameters by searching the development set of Rich-HF and a subset of the training set from the FOIL dataset. We report two variants of CLIP: Open AI CLIP Vi TB/32 (Radford et al. 2021), following Hessel et al. (2021), and Vi T-H/14 trained on LAION-2B (Schuhmann et al. 2022) from Open Clip (Cherti et al. 2023), which yields our best score. For detailed information about the datasets and experiments on additional benchmarks, please refer to the supplementary materials. FOIL and nocaps-FOIL. FOIL (Shekhar et al. 2017) and nocaps-FOIL (Petryk et al. 2024) are benchmarks for detecting misaligned captions... In nocaps-FOIL, we report results as in-domain, near-domain, or out-of-domain based on how similar the altered objects are to COCO object classes. HAT. The HAT dataset (Petryk et al. 2024) comprises 400 human-annotated samples... See TRUE-Feedback. See TRUE-Feedback (Gordon et al. 2024) comprises a test set of 2K samples... Rich-HF. Rich-HF (Liang et al. 2024) comprises 955 prompt and image pairs with word-level misalignment annotations and overall alignment score. |
| Hardware Specification | Yes | Frames-Per-Second (FPS) is measured with a single V100. |
| Software Dependencies | No | The paper refers to models and datasets like Open AI CLIP Vi TB/32 and BARTNLI model with citations, but it does not specify software library version numbers (e.g., PyTorch version, CUDA version, or specific versions of any other key dependencies). |
| Experiment Setup | Yes | We set our hyperparameters by searching the development set of Rich-HF and a subset of the training set from the FOIL dataset. We use l to 10 and 22 for Vi T-B/32 and Vi T-H/14, respectively, utilizing the final three layers in both cases. Unless otherwise specified, ϵ is set to -0.00005. Frames-Per-Second (FPS) is measured with a single V100. Finally, we use F-CLIPScore for the global misalignment classification task. |