SeRA: Self-Reviewing and Alignment of LLMs using Implicit Reward Margins

Authors: Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh, Sailik Sengupta, Sravan Babu Bodapati, Aram Galstyan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimentation, including some on instructionfollowing tasks, demonstrate the effectiveness and generality of Se RA in training LLMs on offline preference datasets with DAAs. We empirically showed that Se RA can be widely used across various DAAs (e.g., DPO, IPO, SLi C-HF, Sim PO) and on various LLMs (e.g., Tiny Llama1.1B, Pythia-2.8B, Mistral-7B) consistently outperforming So TA baselines (Kim et al., 2024a; Pattnaik et al., 2024).
Researcher Affiliation Collaboration Jongwoo Ko1 Saket Dingliwal2 Bhavana Ganesh2 Sailik Sengupta3 Sravan Bodapati2 Aram Galstyan2 1KAIST AI 2 mazon AGI 3 WS AI Labs
Pseudocode Yes Algorithm 1: Self-Reviewing and Alignment
Open Source Code No The paper does not provide an explicit statement about releasing code for the methodology, nor does it include a direct link to a code repository.
Open Datasets Yes Ultra Chat-200K (instruction-following; Tunstall et al. 2023 2): This is a heavily filtered version of Ultra Chat (Ding et al., 2023), originally used to train Zephyr-7B-β (Tunstall et al., 2023). It is obtained from the original version, which consists of 1.4M dialogues generated by Chat GPT and spans a wide range of topics, by removing the dialogues that contain grammatical errors or where the assistant replies with phrases like I do not have emotions or I don t have opinions. Ultra Feedback (preference dataset; Cui et al. 2023; Tunstall et al. 2023 3 4): This is a largescale, fine-grained, and diverse preference dataset used for training powerful reward models and critic models. Cui et al. (2023) collected about 64k prompts from diverse resources, including Ultra Chat, Share GPT, and Evol-Instruction (Xu et al., 2023). HH-RLHF (preference datasets; Bai et al. 2022a 5): This dataset is about human preference regarding helpfulness and harmlessness Bai et al. (2022a), and it was originally used to train preference (or reward) models for subsequent RLHF training. TL;DR (preference datasets; Stiennon et al. 2020 6): This is the dataset of human feedback that was released for reward modeling. Alpaca Eval (instruction-following; Dubois et al. 2024b 7): This dataset is slight modifications (or simplification) of the Alpaca Farm evaluation set. Vicuna Evaluation (instruction-following; Chiang et al. 2023 8): We also use 80 challenging questions that were used for evaluating Vicuna, following Pattnaik et al. (2024). Evol-Instruct Evaluation (instruction-following; Xu et al. 2023 9): Similar to Vicuna, Evol Instruct (Xu et al., 2023) contains 218 questions, spanning multiple topics generated using the Evol-Instruct procedure.
Dataset Splits Yes We use the binary version of Ultra Feedback (which contains two response pairs with corresponding ratings for a given input query) (Tunstall et al., 2023) for all DAAs except Curri-DPO (Pattnaik et al., 2024), where we use the original Ultra Feedback (that contains four response pairs) (Cui et al., 2023). The binarized version provides train and test splits of the prompt and response pairs; based on this, we train the models on the train split and evaluate the trained models on the test split.
Hardware Specification Yes For other models, we use the maximum batch size that fits on A100 40GB GPUs, while matching the effective batch size with Mistral-7B by considering the batch size and gradient accumulation. The results were obtained using amachine with 4 A100 (40GB) GPUs on Tiny Llama-1.1B with k = 0.7N and k = 0.3N.
Software Dependencies No The paper mentions using "vLLM (Kwon et al., 2023)" as an LLM inference framework, but it does not specify version numbers for this or any other software components (e.g., programming language, deep learning libraries) used for implementation.
Experiment Setup Yes Our hyperparameters are shown in Tab.4. For Mistral-7B, we follow the experimental setup described in the official repository10 of Tunstall et al. (2023), except for the rank for Lo RA (Hu et al., 2022), changing it to 8. For other models, we use the maximum batch size that fits on A100 40GB GPUs, while matching the effective batch size with Mistral-7B by considering the batch size and gradient accumulation. Table 4: Hyperparameter values used in Se RA experiments in section 4 and section 5. Hyperparameter Tiny LLa MA-1.1B Pythia-2.8B Mistral-7B Fine-tuning method Full fine-tuning Lo RA (r = 8) Learning rate 3.0 10 6 5.0 10 6 DAAs Parameter (β) 0.2 (DPO) / 0.2 (SLi C-HF) / 1.0 (IPO) 0.01 (DPO) Batch Size 8 4 4 Gradient Accumulation 2 4 4 # Iterations 3 (1 epoch per iteration) Selection Proportion (K) 0.7. For all experiments, we set T = 3, γ = 0.3. For the DAAs parameter β, we search for the optimal values among 0.05, 0.2, 1.0 for Tiny LLa MA1.1B and reuse it for Pythia-2.8B in all experimental setups. To generate the diverse candidate responses for preference bootstrapping, we sample the responses with a temperature of 0.7 and p of 0.95 for nucleus sampling in training procedure. For all experiments, we generate the 4 response for every single prompt.