reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Natural Language Inference Improves Compositionality in Vision-Language Models

Authors: Paola Cascante-Bonilla, Yu (Hope) Hou, Yang Cao, Hal Daumé III, Rachel Rudinger

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we show that CECE enhances interpretability and reduces overreliance on biased or superficial features. By balancing CECE along the original premise, we achieve significant improvements over previous methods without requiring additional fine-tuning, producing state-of-the-art results on benchmarks that score agreement with human judgments for image-text alignment, and achieving an increase in performance on Winoground of +19.2% (group score) and +12.9% on Eq Ben (group score) over the best prior work (finetuned with targeted data).
Researcher Affiliation	Academia	Paola Cascante-Bonilla 1,2 Yu Hou1 Yang Trista Cao3 Hal Daum e III1 Rachel Rudinger1 1University of Maryland, College Park 2Stony Brook University 3University of Texas at Austin
Pseudocode	No	The paper describes the methodology in Section 3 and provides a 'Prompt template' in Figure 4, which is a structured text example rather than formal pseudocode or an algorithm block.
Open Source Code	Yes	Project page: https://cece-vlm.github.io/
Open Datasets	Yes	We report results on two benchmarks (Winoground (Thrush et al., 2022), Eq Ben (Wang et al., 2023b))... We report results on five text-to-image evaluation benchmarks (Draw Bench (Saharia et al., 2022), Edit Bench (Wang et al., 2023a), COCO-T2I (Lin et al., 2014), TIFA160 (Hu et al., 2023), Pick-a-Pic (Kirstain et al., 2023))... We report results on the Stanford T23D (Wu et al., 2024) benchmark with the human ratings collected by Lin et al. (2024).
Dataset Splits	Yes	We conduct experiments on Winoground and Eq Ben. Results are shown in Table 1... We show results on five text-to-image evaluation benchmarks in Table 2... We report results on the Stanford T23D (Wu et al., 2024) benchmark with the human ratings collected by Lin et al. (2024). These are all well-established benchmarks with defined evaluation sets.
Hardware Specification	No	The paper does not explicitly provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using 'Llama3.1 70B' as an LLM and various VLMs like 'BLIPv2', 'Instruct BLIP', 'LLa VA-1.5', and 'LLa VA-1.6'. However, it does not specify versions for general software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or operating systems, which are typically required for reproducibility.
Experiment Setup	Yes	To integrate the information from entailments, contradictions, and the original caption, we employ a two-step balancing process using hyperparameters α1 and α2... We use α1 = 0.5 and α2 = 0.6 in all experiments.