Multi-Reference Preference Optimization for Large Language Models

Authors: Hung Le, Quan Hung Tran, Dung Nguyen, Kien Do, Saloni Mittal, Kelechi Ogueji, Svetha Venkatesh

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance. Furthermore, MRPO effectively finetunes LLMs to exhibit superior performance in several downstream natural language processing tasks such as HH-RLHF, GSM8K and Truthful QA. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance. Furthermore, MRPO effectively finetunes LLMs to exhibit superior performance in several downstream natural language processing tasks such as HH-RLHF, GSM8K and Truthful QA. In preference learning tasks involving 6 preference datasets, MRPO demonstrates significant superiority over DPO and multi-reference baselines, especially when preference data is limited with improvement of up to 7%. In terms of helpfulness evaluation, MRPO significantly outperforms DPO by 13.7%.
Researcher Affiliation Collaboration Hung Le1, Quan Hung Tran2, Dung Nguyen1, Kien Do1, Saloni Mittal2, Kelechi Ogueji2, Svetha Venkatesh1 1Applied AI Institute, Deakin University, Geelong, Australia 2Service Now Research, USA
Pseudocode No The paper describes methods like 'Multi-Reference Preference Optimization', 'Clipped Trust-Regions Optimization (CTRO)', and 'Adaptive Reference Weighting Coefficients (ARWC)' with mathematical formulations, but it does not include a distinct block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes The source code can be accessed at: https://github.com/thaihungle/MRPO.
Open Datasets Yes Each training dataset comprises a random subset from a larger public preference dataset available in the Hugging Face data repository. We utilize 3 big preference datasets: Help Steer, Ultrafeedback, and Nectar (see Appendix Table 7). These models are evaluated on 2 datasets: the helpful and harmless HH-RLHF (Bai et al. 2022) and Alpaca-Eval 2.0 (Li et al. 2023). For our evaluation benchmark, we utilized the Huggingface Open LLM Leaderboard, a standard in the field (Beeching et al. 2023).
Dataset Splits Yes Each training dataset comprises a random subset from a larger public preference dataset available in the Hugging Face data repository. The remaining portions of the dataset will be utilized as testing data. We use the provided train/test split for Help Steer and Ultrafeedback. We randomly allocate 90% of the data for training purposes and 10% for testing for Nectar.
Hardware Specification Yes Unless specified otherwise, we finetune these LLMs using Lo RA 4-bit quantization to enable faster training and accommodate our hardware of a single Tesla A100 GPU with 32GB of memory.
Software Dependencies No The paper mentions various models and libraries (e.g., Llama, Mistral, Qwen, LoRA, Hugging Face Open LLM Leaderboard, Language Model Evaluation Harness library), but it does not specify version numbers for general software dependencies such as Python, PyTorch, or Hugging Face Transformers library.
Experiment Setup Yes Unless stated otherwise, all baselines the same common hyperparameters such as learning rate (10 5), batch size (8), number of epochs (3), and β = 0.1. For Multi-DPO, we have to use clipped trust regions to ensure Ref M2 is close to Ref M1. Otherwise, the learning will not converge. To make a fair comparison, both MRPO Multi-DPO, and KD use ϵmax = 0.1 and incorporate the adaptive ϵ.