reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs

Authors: Shuo Li, Tao Ji, Xiaoran Fan, Linsheng Lu, Leyi Yang, Yuming Yang, Zhiheng Xi, Rui Zheng, Yuran Wang, xh.zhao, Tao Gui, Qi Zhang, Xuanjing Huang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents the first sycophancy evaluation benchmark for VLMs, named MM-SY, which covers ten diverse visual understanding tasks. We reveal that VLMs still sycophantically agree with users while ignoring visual facts, influenced by various factors like different tasks, user tones, model sizes, etc. To mitigate it, inspired by methods for reducing hallucination in LLMs, we investigate three methods: prompt-based, supervised fine-tuning, and direct preference optimization. We find that their ability to reduce sycophancy improves progressively.
Researcher Affiliation	Collaboration	Fudan University1 Honor Device Co., Ltd2 EMAIL, EMAIL
Pseudocode	No	The paper describes methods using mathematical formulations (e.g., Lpsftq syc log PΘ pytrue \| Csycq) and textual descriptions but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our benchmark and code are available at https://github.com/galactic123/Sycophancy in VLMs.
Open Datasets	Yes	To facilitate the detection of sycophancy, we utilize a VQA dataset TDIUC (Wu et al., 2019)
Dataset Splits	Yes	The training and test set sizes are 3000 and 800 samples, respectively.
Hardware Specification	No	The computations in this research were performed using the CFFF platform of Fudan University. No specific details about the CPU, GPU models, or memory were provided.
Software Dependencies	No	The paper mentions using GPT-4V for data generation and refers to LLaVA-1.5, BLIP-2, and Instruct BLIP models, but does not provide specific version numbers for software libraries, frameworks (like PyTorch or TensorFlow), or Python.
Experiment Setup	Yes	Table 9: Hyperparameters setting of our SFT and DPO training. Hyperparameter SFT DPO lr 2e-5 1e-6 lr schedule cosine decay batch size 128 8 weight decay 0 epoch 1 optimizer Adam W tensor precision bf16