RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
Authors: Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Murun Yang, Qiaozhi He, Tong Xiao, Chunliang Zhang, Tongran Liu, Jingbo Zhu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment with Ro VRM on the commonly used vision-language tasks based on the LLa VA1.5-7B and -13B models. Experimental results demonstrate that Ro VRM consistently outperforms traditional VRMs. Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization. Through experiments on commonly used vision-language tasks, we aim to evaluate Ro VRM using two human-preference alignment techniques: best-of-n sampling and RL. Our results demonstrate improved performance in each task when aligned with reward signals from Ro VRM. |
| Researcher Affiliation | Collaboration | 1 School of Computer Science and Engineering, Northeastern University, Shenyang, China 2 Niu Trans Research, Shenyang, China 3 CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China |
| Pseudocode | No | The paper describes methods and processes through text and diagrams (Figure 1), but does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting. |
| Open Source Code | Yes | Our code is publicly available*. *https://github.com/Niu Trans/Vision-LLM-Alignment |
| Open Datasets | Yes | Textual Preference Dataset: We used Ultra Feedback (Cui et al. 2023)... Image Caption-based Preference Dataset: ...when the image is present in the COCO caption dataset , we used the human-annotated captions directly. Visual Preference Dataset: We employed the visual preference dataset from RLAIF-V (Yu et al. 2024b)... RL Training: We sampled 50k instructions from LLa VAInstruct-150K (Liu et al. 2024b) for training. https://huggingface.co/datasets/lmms-lab/COCOCaption2017 |
| Dataset Splits | No | The paper mentions sampling specific numbers of instructions for training (e.g., "We sampled 50k instructions from LLa VAInstruct-150K... for training") and using certain samples for warming up the VRM (e.g., "5k samples to warm up the VRM, consisting of 2k samples from the dataset to be selected and 3k samples from the target preference dataset"). However, it does not explicitly provide the train/test/validation splits for the primary datasets used in model evaluation or training, nor does it refer to specific predefined splits for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It mentions using LLaVA-1.5-7B and LLaVA-1.5-13B models but not the hardware they were run on. |
| Software Dependencies | No | Our implementation of optimal transport solvers is done using Python Optimal Transport (POT) . While this mentions a software library (POT), it does not specify a version number, which is required for reproducibility. |
| Experiment Setup | Yes | For training Ro VRM, we used the LLa VA-1.57B model to initialize the visual reward model. The learning rates for the three-phase progressive training were set to 2e-5 for phase one, and 1e-6 for phases two and three. For optimal transport-based preference data selection, we used 5k samples to warm up the VRM, consisting of 2k samples from the dataset to be selected and 3k samples from the target preference dataset. The representative subset size was set to 5k samples. For best-of-n sampling and RL training, we employed the LLa VA-1.5-7B as the initial model. In the process of best-of-n sampling, we set the sampling size to 8. |