FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback
Authors: Liqiang Jing, Xinya Du
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on hallucination and general benchmarks, demonstrating the superior performance of our proposed method. Notably, compared with previous models trained with the RL-based aligning method, our proposed method is effective even with fewer parameters. |
| Researcher Affiliation | Academia | Liqiang Jing Xinya Du Department of Computer Science, The University of Texas at Dallas EMAIL, EMAIL |
| Pseudocode | Yes | The algorithm 1 shows in detail how PPO updates the policy LM Pθ and the value model Vψ, with K fine-grained reward models Ro/a/r. Algorithm 1 Fine-Grained Reinforcement Learning from AI Feedback (FGAIF) |
| Open Source Code | Yes | 1We released our code via https://github.com/Liqiang Jing/FGAIF. |
| Open Datasets | Yes | We sample 3,500 and 14,000 examples from the MSCOCO 2014 (Lin et al., 2014) training set for reward model training and LVLM training, respectively. To assess the sensitivity of our approach to different object types, we followed prior works (Jiang et al., 2024; Yan et al., 2024) and constructed a dedicated out-of-the-distribution test set based on the Foggy dataset (Cordts et al., 2016). |
| Dataset Splits | Yes | We sample 3,500 and 14,000 examples from the MSCOCO 2014 (Lin et al., 2014) training set for reward model training and LVLM training, respectively. The original test set (comprising 500 samples) covered a range of [6, 15]. To assess model performance on longer responses, we collected an additional 200 samples with sub-segment counts between [15, 20]. |
| Hardware Specification | Yes | All experiments are conducted on a 4 A100 80G GPU Server. |
| Software Dependencies | Yes | The version of our Chat PT is gpt-3.5-turbo-0125. Furthermore, our FGAIF can bring more performance gain in terms of the Conv subset, Detail , Complex subset, and full set), compared with LLa VA-RLHF7B. This further indicates the advance of our method. we replaced gpt-3.5-turbo0125 with gpt-4o-2024-08-06 in our method and found that the performance remained consistent, achieving an F1 score of 83.5 on the POPE dataset. |
| Experiment Setup | Yes | For the reward model training, we use the Adam optimizer, and the learning rate, batch size, and epoch are set to 2e-5, 4, and 100. For the PPO training, we use the Adam optimizer, and the learning rate, batch size, and epoch are set to 1e-7, 256, and 2. We sample 3,500 and 14,000 examples from the MSCOCO 2014 (Lin et al., 2014) training set for reward model training and LVLM training, respectively. The prompt is set to Describe this image in detail. for model training and sample. we adopt Lo RA Hu et al. (2022a) for all the reward model training and the LVLM fine-tuning processes. |