reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

Authors: Liqiang Jing, Xinya Du

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted on hallucination and general benchmarks, demonstrating the superior performance of our proposed method. Notably, compared with previous models trained with the RL-based aligning method, our proposed method is effective even with fewer parameters.
Researcher Affiliation	Academia	Liqiang Jing Xinya Du Department of Computer Science, The University of Texas at Dallas EMAIL, EMAIL
Pseudocode	Yes	The algorithm 1 shows in detail how PPO updates the policy LM Pθ and the value model Vψ, with K fine-grained reward models Ro/a/r. Algorithm 1 Fine-Grained Reinforcement Learning from AI Feedback (FGAIF)
Open Source Code	Yes	1We released our code via https://github.com/Liqiang Jing/FGAIF.
Open Datasets	Yes	We sample 3,500 and 14,000 examples from the MSCOCO 2014 (Lin et al., 2014) training set for reward model training and LVLM training, respectively. To assess the sensitivity of our approach to different object types, we followed prior works (Jiang et al., 2024; Yan et al., 2024) and constructed a dedicated out-of-the-distribution test set based on the Foggy dataset (Cordts et al., 2016).
Dataset Splits	Yes	We sample 3,500 and 14,000 examples from the MSCOCO 2014 (Lin et al., 2014) training set for reward model training and LVLM training, respectively. The original test set (comprising 500 samples) covered a range of [6, 15]. To assess model performance on longer responses, we collected an additional 200 samples with sub-segment counts between [15, 20].
Hardware Specification	Yes	All experiments are conducted on a 4 A100 80G GPU Server.
Software Dependencies	Yes	The version of our Chat PT is gpt-3.5-turbo-0125. Furthermore, our FGAIF can bring more performance gain in terms of the Conv subset, Detail , Complex subset, and full set), compared with LLa VA-RLHF7B. This further indicates the advance of our method. we replaced gpt-3.5-turbo0125 with gpt-4o-2024-08-06 in our method and found that the performance remained consistent, achieving an F1 score of 83.5 on the POPE dataset.
Experiment Setup	Yes	For the reward model training, we use the Adam optimizer, and the learning rate, batch size, and epoch are set to 2e-5, 4, and 100. For the PPO training, we use the Adam optimizer, and the learning rate, batch size, and epoch are set to 1e-7, 256, and 2. We sample 3,500 and 14,000 examples from the MSCOCO 2014 (Lin et al., 2014) training set for reward model training and LVLM training, respectively. The prompt is set to Describe this image in detail. for model training and sample. we adopt Lo RA Hu et al. (2022a) for all the reward model training and the LVLM fine-tuning processes.