MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences
Authors: Weitao Wang, Haoran Xu, Yuxiao Yang, Zhifang Liu, Jun Meng, Haoqian Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences. We conduct a user study to evaluate MVReward s ability in predicting human preferences. We perform ablation studies on the encoder backbone, multi-view self-attention, and negative samples to assess their effects on MVReward. |
| Researcher Affiliation | Academia | 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Zhejiang University |
| Pseudocode | Yes | Algorithm 1: Multi-View Preference Learning (MVP) for Multi-View DMs |
| Open Source Code | Yes | Code https://github.com/victor-thu/MVReward |
| Open Datasets | Yes | We begin by generating and filtering a standardized image prompt set from DALL E (Ramesh et al. 2021) and Objaverse (Deitke et al. 2023), ensuring the object(s) in each image are fully visible with well-designed geometry and texture. Furthermore, taking the widely-used GSO dataset (Downs et al. 2022) as an example |
| Dataset Splits | Yes | The training, validation and test datasets are split according to an 8:1:1 ratio. |
| Hardware Specification | Yes | Optimal performance is achieved with a batch size of 96 in total, an initial learning rate of 4e-5 using cosine annealing, on 4 NVIDIA Quadro RTX 8000. Both models are fine-tuned in half-precision on 8 NVIDIA Quadro RTX 8000, with a batch size of 128 in total and a learning rate of 5e-6 with warm-up. |
| Software Dependencies | No | The paper mentions BLIP and VIT-B as pre-trained models but does not specify version numbers for any software libraries or dependencies used in their implementation. |
| Experiment Setup | Yes | Optimal performance is achieved with a batch size of 96 in total, an initial learning rate of 4e-5 using cosine annealing, on 4 NVIDIA Quadro RTX 8000. Both models are fine-tuned in half-precision on 8 NVIDIA Quadro RTX 8000, with a batch size of 128 in total and a learning rate of 5e-6 with warm-up. The model parameters are fixed except for the designated trainable modules within the UNet. |