reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment

Authors: Chenhang Cui, An Zhang, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, Tat-Seng Chua

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through both theoretical analysis and experimental validation, we demonstrate that Fi SAO effectively addresses the misalignment problem in VLLMs, marking the first instance of token-level rewards being applied to such models. Our code is avaliable at https://github.com/gzcch/FISAO_ICLR. 4 EXPERIMENT In this section, we evaluate Fi SAO on the modality alignment of Vision-Language Large Models (VLLMs), showcasing its effectiveness in enhancing models performance.
Researcher Affiliation	Academia	1National University of Singapore, 2UNC-Chapel Hill, 3University of Chicago, 4Nanyang Technological University
Pseudocode	Yes	We show the detailed process of Fi SAO in Algorithm 1. Algorithm 1 Fi SAO
Open Source Code	Yes	Our code is avaliable at https://github.com/gzcch/FISAO_ICLR.
Open Datasets	Yes	We generate captions for 5,000 images randomly sampled from the COCO training dataset and utilize the widely recognized CHAIR hallucination benchmark (Rohrbach et al., 2018) to identify correctly identified and hallucinated objects. We select the first 8k data from the LLa VA-Instruct 150k dataset (Li et al., 2023b). Evaluation Benchmarks. We conduct evaluations on three types of benchmarks: comprehensive benchmarks, general VQA benchmarks and COCO benchmarks. Specifically, these include: (1) Comprehensive benchmarks (MME (Fu et al., 2024), SEEDbench (Li et al., 2023a), MMbench (Liu et al., 2024c), MM-Vet (Yu et al., 2023b)); (2)VQA (Science QA (SQA) (Lu et al., 2022), POPE (Li et al., 2023e), GQA (Hudson & Manning, 2019)); (3) Caption benchmark (Li et al., 2024) (Average score of BLEU, ROUGE-L and CIDER), CHAIR (Rohrbach et al., 2019) ).
Dataset Splits	Yes	We select the first 8k data from the LLa VA-Instruct 150k dataset (Li et al., 2023b). We generate captions for 5,000 images randomly sampled from the COCO training dataset and utilize the widely recognized CHAIR hallucination benchmark (Rohrbach et al., 2018) to identify correctly identified and hallucinated objects. Specifically, we randomly sampled 500 images from the COCO (Lin et al., 2015) validation set and evaluated object hallucination using the CHAIR metric.
Hardware Specification	Yes	Training is conducted over one epoch, with Proximal Policy Optimization (PPO) being applied for four epochs per sample, utilizing four A100 80GB GPUs.
Software Dependencies	No	During the preference tuning process, we adapt Low-Rank Adaptation (Lo RA) (Hu et al., 2021) fine-tuning. This statement mentions a technique (LoRA) but does not provide specific software names with version numbers for core dependencies like programming languages, libraries, or frameworks.
Experiment Setup	Yes	Table 7: Training parameters for LLa VA-1.5 7B and Instruct BLIP 13B models. This table includes specific values for Number of Epochs (1), PPO Training Epochs (4), Lo RA r (128), Lo RA Alpha (256), Learning Rate (5e-7 for LLa VA-1.5, 4e-6 for Instruct BLIP), ξ (0.2), and λ (10).