VSCoDe: Visual-Augmentation Selection for Contrastive Decoding

Authors: Sihyeon Kim, Boryeong Cho, Sangmin Bae, Sumyeong Ahn, Se-Young Yun

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluations demonstrate that VSCo De outperforms previous methods and enhances the quality of various vision-language tasks without additional training or reliance on external models. Our findings indicate that each augmentation has a distinct impact on the given question, altering the output distribution of VLMs and subsequently affecting the response.
Researcher Affiliation Academia Sihyeon Kim EMAIL KAIST AI Boryeong Cho EMAIL KAIST AI Sangmin Bae EMAIL KAIST AI Sumyeong Ahn EMAIL KENTECH Se-Young Yun EMAIL KAIST AI
Pseudocode Yes Algorithm 1 VSCo De: Visual-Augmented Contrastive Decoding
Open Source Code No The paper does not provide an explicit statement or a link to the source code for the methodology described in this paper.
Open Datasets Yes We conduct experiments on Visual Question Answering (VQA) tasks and captioning tasks. [...] We use MME (Fu et al., 2024), MMBench (Liu et al., 2024b), VQAv2 (Goyal et al., 2017), and POPE (Li et al., 2023c) benchmarks for the VQA task, and each dataset consists of image-question pairs. For the captioning task, we evaluate on MSCOCO (Lin et al., 2015).
Dataset Splits No The paper mentions selecting '30K samples from the VQAv2 evaluation dataset' and '500 random images from the validation set' for MSCOCO, but does not provide explicit training/test/validation splits (percentages, counts, or explicit standard split citations) for all datasets used (MME, MMBench, POPE).
Hardware Specification Yes In this paper, all reports of our experiment used LVLM models that can run on a single 48 GB NVIDIA RTX A6000.
Software Dependencies No Following the original paper, we use spa Cy (Honnibal et al., 2020) for the noun phrase detector and obtained bounding boxes by using Grounding DINO-B (Liu et al., 2023c) with a threshold of 0.3. While specific software is mentioned, version numbers for 'spaCy' and 'Grounding DINO-B' are not provided.
Experiment Setup Yes We choose α = 1.0 and β = 0.1 for the main experiment. Additionally, we use T = 1.0 and p = 1.0 for the sampling strategy, which employs the softmax distribution for the next token generation. Additionally, we conducted experiments under various decoding settings, and the corresponding ablation studies can be found in Appendix C.