reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Authors: Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Murun Yang, Qiaozhi He, Tong Xiao, Chunliang Zhang, Tongran Liu, Jingbo Zhu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with Ro VRM on the commonly used vision-language tasks based on the LLa VA1.5-7B and -13B models. Experimental results demonstrate that Ro VRM consistently outperforms traditional VRMs. Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization. Through experiments on commonly used vision-language tasks, we aim to evaluate Ro VRM using two human-preference alignment techniques: best-of-n sampling and RL. Our results demonstrate improved performance in each task when aligned with reward signals from Ro VRM.
Researcher Affiliation	Collaboration	1 School of Computer Science and Engineering, Northeastern University, Shenyang, China 2 Niu Trans Research, Shenyang, China 3 CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China
Pseudocode	No	The paper describes methods and processes through text and diagrams (Figure 1), but does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code	Yes	Our code is publicly available. https://github.com/Niu Trans/Vision-LLM-Alignment
Open Datasets	Yes	Textual Preference Dataset: We used Ultra Feedback (Cui et al. 2023)... Image Caption-based Preference Dataset: ...when the image is present in the COCO caption dataset , we used the human-annotated captions directly. Visual Preference Dataset: We employed the visual preference dataset from RLAIF-V (Yu et al. 2024b)... RL Training: We sampled 50k instructions from LLa VAInstruct-150K (Liu et al. 2024b) for training. https://huggingface.co/datasets/lmms-lab/COCOCaption2017
Dataset Splits	No	The paper mentions sampling specific numbers of instructions for training (e.g., "We sampled 50k instructions from LLa VAInstruct-150K... for training") and using certain samples for warming up the VRM (e.g., "5k samples to warm up the VRM, consisting of 2k samples from the dataset to be selected and 3k samples from the target preference dataset"). However, it does not explicitly provide the train/test/validation splits for the primary datasets used in model evaluation or training, nor does it refer to specific predefined splits for reproduction.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It mentions using LLaVA-1.5-7B and LLaVA-1.5-13B models but not the hardware they were run on.
Software Dependencies	No	Our implementation of optimal transport solvers is done using Python Optimal Transport (POT) . While this mentions a software library (POT), it does not specify a version number, which is required for reproducibility.
Experiment Setup	Yes	For training Ro VRM, we used the LLa VA-1.57B model to initialize the visual reward model. The learning rates for the three-phase progressive training were set to 2e-5 for phase one, and 1e-6 for phases two and three. For optimal transport-based preference data selection, we used 5k samples to warm up the VRM, consisting of 2k samples from the dataset to be selected and 3k samples from the target preference dataset. The representative subset size was set to 5k samples. For best-of-n sampling and RL training, we employed the LLa VA-1.5-7B as the initial model. In the process of best-of-n sampling, we set the sampling size to 8.