Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment

Authors: Chenhang Cui, An Zhang, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, Tat-Seng Chua

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through both theoretical analysis and experimental validation, we demonstrate that Fi SAO effectively addresses the misalignment problem in VLLMs, marking the first instance of token-level rewards being applied to such models. Our code is avaliable at https://github.com/gzcch/FISAO_ICLR. 4 EXPERIMENT In this section, we evaluate Fi SAO on the modality alignment of Vision-Language Large Models (VLLMs), showcasing its effectiveness in enhancing models performance.
Researcher Affiliation Academia 1National University of Singapore, 2UNC-Chapel Hill, 3University of Chicago, 4Nanyang Technological University
Pseudocode Yes We show the detailed process of Fi SAO in Algorithm 1. Algorithm 1 Fi SAO
Open Source Code Yes Our code is avaliable at https://github.com/gzcch/FISAO_ICLR.
Open Datasets Yes We generate captions for 5,000 images randomly sampled from the COCO training dataset and utilize the widely recognized CHAIR hallucination benchmark (Rohrbach et al., 2018) to identify correctly identified and hallucinated objects. We select the first 8k data from the LLa VA-Instruct 150k dataset (Li et al., 2023b). Evaluation Benchmarks. We conduct evaluations on three types of benchmarks: comprehensive benchmarks, general VQA benchmarks and COCO benchmarks. Specifically, these include: (1) Comprehensive benchmarks (MME (Fu et al., 2024), SEEDbench (Li et al., 2023a), MMbench (Liu et al., 2024c), MM-Vet (Yu et al., 2023b)); (2)VQA (Science QA (SQA) (Lu et al., 2022), POPE (Li et al., 2023e), GQA (Hudson & Manning, 2019)); (3) Caption benchmark (Li et al., 2024) (Average score of BLEU, ROUGE-L and CIDER), CHAIR (Rohrbach et al., 2019) ).
Dataset Splits Yes We select the first 8k data from the LLa VA-Instruct 150k dataset (Li et al., 2023b). We generate captions for 5,000 images randomly sampled from the COCO training dataset and utilize the widely recognized CHAIR hallucination benchmark (Rohrbach et al., 2018) to identify correctly identified and hallucinated objects. Specifically, we randomly sampled 500 images from the COCO (Lin et al., 2015) validation set and evaluated object hallucination using the CHAIR metric.
Hardware Specification Yes Training is conducted over one epoch, with Proximal Policy Optimization (PPO) being applied for four epochs per sample, utilizing four A100 80GB GPUs.
Software Dependencies No During the preference tuning process, we adapt Low-Rank Adaptation (Lo RA) (Hu et al., 2021) fine-tuning. This statement mentions a technique (LoRA) but does not provide specific software names with version numbers for core dependencies like programming languages, libraries, or frameworks.
Experiment Setup Yes Table 7: Training parameters for LLa VA-1.5 7B and Instruct BLIP 13B models. This table includes specific values for Number of Epochs (1), PPO Training Epochs (4), Lo RA r (128), Lo RA Alpha (256), Learning Rate (5e-7 for LLa VA-1.5, 4e-6 for Instruct BLIP), ξ (0.2), and λ (10).