VSCoDe: Visual-Augmentation Selection for Contrastive Decoding
Authors: Sihyeon Kim, Boryeong Cho, Sangmin Bae, Sumyeong Ahn, Se-Young Yun
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluations demonstrate that VSCo De outperforms previous methods and enhances the quality of various vision-language tasks without additional training or reliance on external models. Our findings indicate that each augmentation has a distinct impact on the given question, altering the output distribution of VLMs and subsequently affecting the response. |
| Researcher Affiliation | Academia | Sihyeon Kim EMAIL KAIST AI Boryeong Cho EMAIL KAIST AI Sangmin Bae EMAIL KAIST AI Sumyeong Ahn EMAIL KENTECH Se-Young Yun EMAIL KAIST AI |
| Pseudocode | Yes | Algorithm 1 VSCo De: Visual-Augmented Contrastive Decoding |
| Open Source Code | No | The paper does not provide an explicit statement or a link to the source code for the methodology described in this paper. |
| Open Datasets | Yes | We conduct experiments on Visual Question Answering (VQA) tasks and captioning tasks. [...] We use MME (Fu et al., 2024), MMBench (Liu et al., 2024b), VQAv2 (Goyal et al., 2017), and POPE (Li et al., 2023c) benchmarks for the VQA task, and each dataset consists of image-question pairs. For the captioning task, we evaluate on MSCOCO (Lin et al., 2015). |
| Dataset Splits | No | The paper mentions selecting '30K samples from the VQAv2 evaluation dataset' and '500 random images from the validation set' for MSCOCO, but does not provide explicit training/test/validation splits (percentages, counts, or explicit standard split citations) for all datasets used (MME, MMBench, POPE). |
| Hardware Specification | Yes | In this paper, all reports of our experiment used LVLM models that can run on a single 48 GB NVIDIA RTX A6000. |
| Software Dependencies | No | Following the original paper, we use spa Cy (Honnibal et al., 2020) for the noun phrase detector and obtained bounding boxes by using Grounding DINO-B (Liu et al., 2023c) with a threshold of 0.3. While specific software is mentioned, version numbers for 'spaCy' and 'Grounding DINO-B' are not provided. |
| Experiment Setup | Yes | We choose α = 1.0 and β = 0.1 for the main experiment. Additionally, we use T = 1.0 and p = 1.0 for the sampling strategy, which employs the softmax distribution for the next token generation. Additionally, we conducted experiments under various decoding settings, and the corresponding ablation studies can be found in Appendix C. |