RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization
Authors: Hanyang Zhao, Genta Winata, Anirban Das, Shi-Xiong Zhang, David Yao, Wenpin Tang, Sambit Sahu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that RAINBOWPO outperforms existing DPO variants. Additionally, we provide insights to guide researchers in developing new DPO methods and assist practitioners in their implementations. Our code is available at https://github.com/Capital One-Research/Rainbow PO. |
| Researcher Affiliation | Collaboration | Hanyang Zhao1 , Genta Indra Winata2 , Anirban Das2 , Shi-Xiong Zhang2, David D. Yao1, Wenpin Tang1, Sambit Sahu2 1Columbia University 2Capital One |
| Pseudocode | Yes | Algorithm 1 RS+ for preferences formulation. |
| Open Source Code | Yes | Our code is available at https://github.com/Capital One-Research/Rainbow PO. |
| Open Datasets | Yes | For evaluation metric, we use widely adopted benchmark Alpaca Eval2, which is composed of 805 questions and evaluate the instruction following capability of the model. Alpaca Eval2 evaluates models with the win rates (WR) of model generations against the reference/base answers generated by GPT4-turbo. The comparisons are by default annotated by a GPT4-turbo and the resulting WR has a 68.5% consistency with human evaluation, according to official Alpaca Eval2 website. Length Controlled (LC) Win Rate (WR) is a debiased version of the WR that control for the length of the outputs and increase the WR s correlation to Chat Arena, while significantly decreasing the length gameability of the annotator. To cross validate the effectiveness of the model and mitigate possible bias of GPT4, we also adopt Llama3-70B instruct as the judge, which is reported to have a 67.5% win rate consistency to humans (close to the performance of GPT4). We include more detailed background information of Alpaca Eval2 in Appendix E.1. For formulating the preference dataset D, we follow the standard RLHF pipeline by directly generating answers from the model (which is thus an on-policy dataset, but the algorithm is still offline) and get AI feedbacks as in Sim PO (Meng et al., 2024): we generate 5 answers from Llama3-8B-Instruct for each prompt in Ultra Feedback (Cui et al., 2023), rank them with scores evaluated by Armo RM (Wang et al., 2024a), and choose the best/worst one as winning/losing answer to form the preference pairs. |
| Dataset Splits | No | The paper describes how the preference dataset is formed (generating answers and ranking them for each prompt in Ultra Feedback) and mentions using Alpaca Eval2 as a benchmark for evaluation, which consists of 805 questions. However, it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for the custom-formed preference dataset or how the Alpaca Eval2 benchmark is used in terms of data splits for their experiments. |
| Hardware Specification | No | Due to constraints in computing resources and time, we defer this investigation to future work. |
| Software Dependencies | No | For training, we adopted the popular library Transformer Reinforcement Learning (TRL),4 which already implemented most aforementioned XPOS algorithms and make everything under the same backend and easy to reproduce. |
| Experiment Setup | Yes | For training, we adopted the popular library Transformer Reinforcement Learning (TRL),4 which already implemented most aforementioned XPOS algorithms and make everything under the same backend and easy to reproduce. If not specified, we train the model with 3 training epochs, which typically yields better performance for each XPOS according to our replication. ... Hyper-parameters. Like all other XPOS, to achieve the best performance, Rainbow PO can introduce an extensive amount of hyper-parameter search for the best performing f, α, β, γ and whether η = 1. For efficient hyper-parameter search, we conducted a greedy search method with the help of our framework and decomposition of effective elements: we search for the best hyper-parameters for those that affects the performance in the most when we gradually add designs to the preference optimization methods. For example, when adding length normalization to the methods, we only search for the best hyper-parameter for the regularization parameter β, and will fix the learning rate and all the training args, which prevents the parameter searching space from exploding. ... We adopt the default template for evaluators provided by Alpaca Eval and we adopt the following fixed generation config for each model: max new tokens 4096, temperature 0.7 and top p 0.1. |