reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VSCoDe: Visual-Augmentation Selection for Contrastive Decoding

Authors: Sihyeon Kim, Boryeong Cho, Sangmin Bae, Sumyeong Ahn, Se-Young Yun

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evaluations demonstrate that VSCo De outperforms previous methods and enhances the quality of various vision-language tasks without additional training or reliance on external models. Our findings indicate that each augmentation has a distinct impact on the given question, altering the output distribution of VLMs and subsequently affecting the response.
Researcher Affiliation	Academia	Sihyeon Kim EMAIL KAIST AI Boryeong Cho EMAIL KAIST AI Sangmin Bae EMAIL KAIST AI Sumyeong Ahn EMAIL KENTECH Se-Young Yun EMAIL KAIST AI
Pseudocode	Yes	Algorithm 1 VSCo De: Visual-Augmented Contrastive Decoding
Open Source Code	No	The paper does not provide an explicit statement or a link to the source code for the methodology described in this paper.
Open Datasets	Yes	We conduct experiments on Visual Question Answering (VQA) tasks and captioning tasks. [...] We use MME (Fu et al., 2024), MMBench (Liu et al., 2024b), VQAv2 (Goyal et al., 2017), and POPE (Li et al., 2023c) benchmarks for the VQA task, and each dataset consists of image-question pairs. For the captioning task, we evaluate on MSCOCO (Lin et al., 2015).
Dataset Splits	No	The paper mentions selecting '30K samples from the VQAv2 evaluation dataset' and '500 random images from the validation set' for MSCOCO, but does not provide explicit training/test/validation splits (percentages, counts, or explicit standard split citations) for all datasets used (MME, MMBench, POPE).
Hardware Specification	Yes	In this paper, all reports of our experiment used LVLM models that can run on a single 48 GB NVIDIA RTX A6000.
Software Dependencies	No	Following the original paper, we use spa Cy (Honnibal et al., 2020) for the noun phrase detector and obtained bounding boxes by using Grounding DINO-B (Liu et al., 2023c) with a threshold of 0.3. While specific software is mentioned, version numbers for 'spaCy' and 'Grounding DINO-B' are not provided.
Experiment Setup	Yes	We choose α = 1.0 and β = 0.1 for the main experiment. Additionally, we use T = 1.0 and p = 1.0 for the sampling strategy, which employs the softmax distribution for the next token generation. Additionally, we conducted experiments under various decoding settings, and the corresponding ablation studies can be found in Appendix C.