reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RLHF Workflow: From Reward Modeling to Online RLHF

Authors: Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including Alpaca Eval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as Human Eval and Truthful QA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available.
Researcher Affiliation	Collaboration	1Salesforce AI Research 2University of Illinois Urbana-Champaign Email: EMAIL, EMAIL.
Pseudocode	Yes	Algorithm 1 Theoretical Online Iterative RLHF with Enhancer Algorithm 2 Practical Version of Online Iterative RLHF with BT Reward Model
Open Source Code	Yes	Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. We also integrate the option of adding such a bias term in the revision of our public code.
Open Datasets	Yes	We collect a set of high-quality instruction datasets for SFT, such as Share GPT (Chiang et al., 2023), Slim Orca (Lian et al., 2023b), Math Instruct (Yue et al., 2023), and Evol-Instruct (Xu et al., 2023a) (see the Appendix for a full list). We summarize the statistics of the open-source datasets that are used for the training in Table 5 and prepare them, as well as our data filtering script, on the huggingface.
Dataset Splits	Yes	We evaluate the models by standard benchmarks, including Alpaca Eval-2, MT-Bench, and Chat-Arena-Hard. Details are provided in the Appendix. We also measure the ability of the resulting models using academic benchmark, including GSM-8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2020), Human Eval (Chen et al., 2021), Truthful QA (Lin et al., 2021), ARC (Clark et al., 2018), and MBPP (Austin et al., 2021).
Hardware Specification	No	The paper mentions using VLLM for inference to accelerate data generation, but does not specify any particular hardware like GPU or CPU models or types. For example: "To accelerate data generation, we use VLLM (Kwon et al., 2023) for inference."
Software Dependencies	No	The paper mentions using the TRL package but does not specify its version number. Other software dependencies are not listed with version numbers. For example: "We use the DPO to approximate the computational oracle and implement DPO with the open-source package TRL3."
Experiment Setup	Yes	The reward model is trained... for one epoch with a global batch size of 512. The learning rate is set to lr = 2 × 10−6, and a cosine learning rate schedule with a warm-up ratio of 0.03 is employed. ... We train the LLa MA-3-8B-based preference model for one epoch. The samples are packed into blocks with length 3072 and a global batch size of 128 is used. The learning rate is set to lr = 5 × 10−6, and a cosine learning rate schedule with a warm-up ratio of 0.03 is employed. ... The training is carried out for one epoch with a learning rate of 2 × 10−5. A cosine scheduler is employed, and the global batch size is set to 32 with a warm-up ratio of 0.03. To accelerate training, we follow Diao et al. (2023); Tunstall et al. (2023) to pack the samples and use a block size of 8192. ... We run DPO with the reference model π0 (the SFT model) on the historical data for 2 epochs... We use a cosine learning rate scheduler with a peak learning rate of 5e-7 and 0.03 warm-up ratio. We use a global batch size of 128 and use a KL coefficient of η = 0.1. Table 7: Training parameters lists n batch size per device 2, n gradient accumulation 8, optim adamw torch, lr scheduler type cosine, num train epochs 2, beta 0.1.