RLHF Workflow: From Reward Modeling to Online RLHF
Authors: Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including Alpaca Eval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as Human Eval and Truthful QA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. |
| Researcher Affiliation | Collaboration | 1Salesforce AI Research 2University of Illinois Urbana-Champaign Email: EMAIL, EMAIL. |
| Pseudocode | Yes | Algorithm 1 Theoretical Online Iterative RLHF with Enhancer Algorithm 2 Practical Version of Online Iterative RLHF with BT Reward Model |
| Open Source Code | Yes | Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. We also integrate the option of adding such a bias term in the revision of our public code. |
| Open Datasets | Yes | We collect a set of high-quality instruction datasets for SFT, such as Share GPT (Chiang et al., 2023), Slim Orca (Lian et al., 2023b), Math Instruct (Yue et al., 2023), and Evol-Instruct (Xu et al., 2023a) (see the Appendix for a full list). We summarize the statistics of the open-source datasets that are used for the training in Table 5 and prepare them, as well as our data filtering script, on the huggingface. |
| Dataset Splits | Yes | We evaluate the models by standard benchmarks, including Alpaca Eval-2, MT-Bench, and Chat-Arena-Hard. Details are provided in the Appendix. We also measure the ability of the resulting models using academic benchmark, including GSM-8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2020), Human Eval (Chen et al., 2021), Truthful QA (Lin et al., 2021), ARC (Clark et al., 2018), and MBPP (Austin et al., 2021). |
| Hardware Specification | No | The paper mentions using VLLM for inference to accelerate data generation, but does not specify any particular hardware like GPU or CPU models or types. For example: "To accelerate data generation, we use VLLM (Kwon et al., 2023) for inference." |
| Software Dependencies | No | The paper mentions using the TRL package but does not specify its version number. Other software dependencies are not listed with version numbers. For example: "We use the DPO to approximate the computational oracle and implement DPO with the open-source package TRL3." |
| Experiment Setup | Yes | The reward model is trained... for one epoch with a global batch size of 512. The learning rate is set to lr = 2 × 10−6, and a cosine learning rate schedule with a warm-up ratio of 0.03 is employed. ... We train the LLa MA-3-8B-based preference model for one epoch. The samples are packed into blocks with length 3072 and a global batch size of 128 is used. The learning rate is set to lr = 5 × 10−6, and a cosine learning rate schedule with a warm-up ratio of 0.03 is employed. ... The training is carried out for one epoch with a learning rate of 2 × 10−5. A cosine scheduler is employed, and the global batch size is set to 32 with a warm-up ratio of 0.03. To accelerate training, we follow Diao et al. (2023); Tunstall et al. (2023) to pack the samples and use a block size of 8192. ... We run DPO with the reference model π0 (the SFT model) on the historical data for 2 epochs... We use a cosine learning rate scheduler with a peak learning rate of 5e-7 and 0.03 warm-up ratio. We use a global batch size of 128 and use a KL coefficient of η = 0.1. Table 7: Training parameters lists n batch size per device 2, n gradient accumulation 8, optim adamw torch, lr scheduler type cosine, num train epochs 2, beta 0.1. |