Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization

Authors: Yue Zhang, Liqiang Jing, Vibhav Gogate

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of our proposed evaluator and optimization method. ... We introduce a new benchmark. To create this benchmark, we developed a new dataset for the DVE task... We conducted a human evaluation to compare the performance of our evaluator with existing metrics... Our experimental results demonstrate that our metric achieves the best correlation with human evaluation results... Our experimental results demonstrate that this new method produces higher-quality updates compared to baseline approaches.
Researcher Affiliation Academia The University of Texas at Dallas EMAIL, EMAIL
Pseudocode No The paper describes methods and algorithms in paragraph text and block descriptions, but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code Yes Our code and data are available at https://github.com/skywalkerzhang/Defeasible Visual Entailment.
Open Datasets Yes To create this benchmark, we developed a new dataset for the DVE task by replacing the premises in the δ-NLI dataset (Rudinger et al. 2020) with images from the Flickr30k dataset (Young et al. 2014). This approach minimizes costs while maximizing the use of existing resources. ... Specifically, for each premise-hypothesis pair (T, H) pair in the SNLI dataset, we replace the text premise with its corresponding image in Flickr30k... We constructed our new dataset based on the Flickr30k, SNLI, and δNLI datasets.
Dataset Splits Yes In this section, we present the statistical overview of the DVE dataset, divided into training, development, and test sets. The statistics are summarized in Table 1. Overall, the DVE dataset s balanced and diverse data support comprehensive training and evaluation of models on visual defeasible inference tasks. ... Table 1: Statistics of the DVE dataset. Statistics Train set Validation set Test set Total samples 93,082 1,888 1,972 Update type dist. Weakener 46,541 944 986 Strengthener 46,541 944 986
Hardware Specification No The paper mentions using specific models like ResNet-50 and BERT, and large vision-language models (LVLMs) such as GPT-4o, but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for training or running experiments.
Software Dependencies No The paper mentions several models and frameworks like Res Net-50, BERT, OFA, UNITER, FGAIF, Co Ca, VILT, FLAVA, CLIP, Instruct BLIP, LLa VA-1.5, m PLUG-Owl, Multimodal-GPT, Mini GPT-4, and GPT-4o. However, it does not specify any version numbers for these software components or underlying libraries (e.g., PyTorch, TensorFlow, Python).
Experiment Setup No Experimental Setup For the Classification Task, we selected seven models, categorized into two types: finetuning-based methods and models evaluated in the zero-shot setting. ... We fine-tuned these models on our training set with standard cross-entropy classification loss function. ... For the Generation Task, we selected six widely used LVLMs in a zero-shot setting as baselines... More details of the experiments can be found in the supplementary material.