FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

Authors: Liqiang Jing, Xinya Du

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on hallucination and general benchmarks, demonstrating the superior performance of our proposed method. Notably, compared with previous models trained with the RL-based aligning method, our proposed method is effective even with fewer parameters.
Researcher Affiliation Academia Liqiang Jing Xinya Du Department of Computer Science, The University of Texas at Dallas EMAIL, EMAIL
Pseudocode Yes The algorithm 1 shows in detail how PPO updates the policy LM Pθ and the value model Vψ, with K fine-grained reward models Ro/a/r. Algorithm 1 Fine-Grained Reinforcement Learning from AI Feedback (FGAIF)
Open Source Code Yes 1We released our code via https://github.com/Liqiang Jing/FGAIF.
Open Datasets Yes We sample 3,500 and 14,000 examples from the MSCOCO 2014 (Lin et al., 2014) training set for reward model training and LVLM training, respectively. To assess the sensitivity of our approach to different object types, we followed prior works (Jiang et al., 2024; Yan et al., 2024) and constructed a dedicated out-of-the-distribution test set based on the Foggy dataset (Cordts et al., 2016).
Dataset Splits Yes We sample 3,500 and 14,000 examples from the MSCOCO 2014 (Lin et al., 2014) training set for reward model training and LVLM training, respectively. The original test set (comprising 500 samples) covered a range of [6, 15]. To assess model performance on longer responses, we collected an additional 200 samples with sub-segment counts between [15, 20].
Hardware Specification Yes All experiments are conducted on a 4 A100 80G GPU Server.
Software Dependencies Yes The version of our Chat PT is gpt-3.5-turbo-0125. Furthermore, our FGAIF can bring more performance gain in terms of the Conv subset, Detail , Complex subset, and full set), compared with LLa VA-RLHF7B. This further indicates the advance of our method. we replaced gpt-3.5-turbo0125 with gpt-4o-2024-08-06 in our method and found that the performance remained consistent, achieving an F1 score of 83.5 on the POPE dataset.
Experiment Setup Yes For the reward model training, we use the Adam optimizer, and the learning rate, batch size, and epoch are set to 2e-5, 4, and 100. For the PPO training, we use the Adam optimizer, and the learning rate, batch size, and epoch are set to 1e-7, 256, and 2. We sample 3,500 and 14,000 examples from the MSCOCO 2014 (Lin et al., 2014) training set for reward model training and LVLM training, respectively. The prompt is set to Describe this image in detail. for model training and sample. we adopt Lo RA Hu et al. (2022a) for all the reward model training and the LVLM fine-tuning processes.