Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment
Authors: Chenhang Cui, An Zhang, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, Tat-Seng Chua
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through both theoretical analysis and experimental validation, we demonstrate that Fi SAO effectively addresses the misalignment problem in VLLMs, marking the first instance of token-level rewards being applied to such models. Our code is avaliable at https://github.com/gzcch/FISAO_ICLR. 4 EXPERIMENT In this section, we evaluate Fi SAO on the modality alignment of Vision-Language Large Models (VLLMs), showcasing its effectiveness in enhancing models performance. |
| Researcher Affiliation | Academia | 1National University of Singapore, 2UNC-Chapel Hill, 3University of Chicago, 4Nanyang Technological University |
| Pseudocode | Yes | We show the detailed process of Fi SAO in Algorithm 1. Algorithm 1 Fi SAO |
| Open Source Code | Yes | Our code is avaliable at https://github.com/gzcch/FISAO_ICLR. |
| Open Datasets | Yes | We generate captions for 5,000 images randomly sampled from the COCO training dataset and utilize the widely recognized CHAIR hallucination benchmark (Rohrbach et al., 2018) to identify correctly identified and hallucinated objects. We select the first 8k data from the LLa VA-Instruct 150k dataset (Li et al., 2023b). Evaluation Benchmarks. We conduct evaluations on three types of benchmarks: comprehensive benchmarks, general VQA benchmarks and COCO benchmarks. Specifically, these include: (1) Comprehensive benchmarks (MME (Fu et al., 2024), SEEDbench (Li et al., 2023a), MMbench (Liu et al., 2024c), MM-Vet (Yu et al., 2023b)); (2)VQA (Science QA (SQA) (Lu et al., 2022), POPE (Li et al., 2023e), GQA (Hudson & Manning, 2019)); (3) Caption benchmark (Li et al., 2024) (Average score of BLEU, ROUGE-L and CIDER), CHAIR (Rohrbach et al., 2019) ). |
| Dataset Splits | Yes | We select the first 8k data from the LLa VA-Instruct 150k dataset (Li et al., 2023b). We generate captions for 5,000 images randomly sampled from the COCO training dataset and utilize the widely recognized CHAIR hallucination benchmark (Rohrbach et al., 2018) to identify correctly identified and hallucinated objects. Specifically, we randomly sampled 500 images from the COCO (Lin et al., 2015) validation set and evaluated object hallucination using the CHAIR metric. |
| Hardware Specification | Yes | Training is conducted over one epoch, with Proximal Policy Optimization (PPO) being applied for four epochs per sample, utilizing four A100 80GB GPUs. |
| Software Dependencies | No | During the preference tuning process, we adapt Low-Rank Adaptation (Lo RA) (Hu et al., 2021) fine-tuning. This statement mentions a technique (LoRA) but does not provide specific software names with version numbers for core dependencies like programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | Table 7: Training parameters for LLa VA-1.5 7B and Instruct BLIP 13B models. This table includes specific values for Number of Epochs (1), PPO Training Epochs (4), Lo RA r (128), Lo RA Alpha (256), Learning Rate (5e-7 for LLa VA-1.5, 4e-6 for Instruct BLIP), ξ (0.2), and λ (10). |