MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Authors: Weitao Wang, Haoran Xu, Yuxiao Yang, Zhifang Liu, Jun Meng, Haoqian Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences. We conduct a user study to evaluate MVReward s ability in predicting human preferences. We perform ablation studies on the encoder backbone, multi-view self-attention, and negative samples to assess their effects on MVReward.
Researcher Affiliation Academia 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Zhejiang University
Pseudocode Yes Algorithm 1: Multi-View Preference Learning (MVP) for Multi-View DMs
Open Source Code Yes Code https://github.com/victor-thu/MVReward
Open Datasets Yes We begin by generating and filtering a standardized image prompt set from DALL E (Ramesh et al. 2021) and Objaverse (Deitke et al. 2023), ensuring the object(s) in each image are fully visible with well-designed geometry and texture. Furthermore, taking the widely-used GSO dataset (Downs et al. 2022) as an example
Dataset Splits Yes The training, validation and test datasets are split according to an 8:1:1 ratio.
Hardware Specification Yes Optimal performance is achieved with a batch size of 96 in total, an initial learning rate of 4e-5 using cosine annealing, on 4 NVIDIA Quadro RTX 8000. Both models are fine-tuned in half-precision on 8 NVIDIA Quadro RTX 8000, with a batch size of 128 in total and a learning rate of 5e-6 with warm-up.
Software Dependencies No The paper mentions BLIP and VIT-B as pre-trained models but does not specify version numbers for any software libraries or dependencies used in their implementation.
Experiment Setup Yes Optimal performance is achieved with a batch size of 96 in total, an initial learning rate of 4e-5 using cosine annealing, on 4 NVIDIA Quadro RTX 8000. Both models are fine-tuned in half-precision on 8 NVIDIA Quadro RTX 8000, with a batch size of 128 in total and a learning rate of 5e-6 with warm-up. The model parameters are fixed except for the designated trainable modules within the UNet.