Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Authors: Shenao Zhang, Donghan Yu, Hiteshi Sharma, Han Zhong, Zhihan Liu, Ziyi Yang, Shuohang Wang, Hany Hassan Awadalla, Zhaoran Wang

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-38B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and Alpaca Eval 2.0, as well as various standard academic benchmarks in different settings. [...] In experiments, we implement SELM using Zephyr-7B-SFT (Tunstall et al., 2023b) and Llama-3-8B-Instruct (Meta, 2024) as base models. [...] We first report the performance of SELM and the baselines on the instruction-following chat benchmarks Alpaca Eval 2.0 (Dubois et al., 2024) and MT-Bench (Zheng et al., 2024) in Table 1. [...] We also conduct pairwise comparisons between SELM, iterative DPO, and the base models to validate the effectiveness of our method. The results for Alpaca Eval 2.0 are shown in Figure 2. [...] We also conduct ablation studies on the implicit reward captured by the SELM and DPO models. [...] Our experiments, conducted with Zephyr-7B-SFT and Llama-3-8B-Instruct models, demonstrate the efficacy of SELM with consistent improvements on Alpaca Eval 2.0, MT-Bench, and academic benchmarks.
Researcher Affiliation Collaboration Shenao Zhang1, Donghan Yu2, Hiteshi Sharma2, Han Zhong3, Zhihan Liu1 Ziyi Yang2, Shuohang Wang2, Hany Hassan2, Zhaoran Wang1 1Northwestern University 2Microsoft 3Peking University
Pseudocode Yes The complete pseudocode of our algorithm, named Self-Exploring Language Models (SELM), is outlined in Algorithm 1. [...] Algorithm 1 Self-Exploring Language Models (SELM) [...] Algorithm 2 Self-Exploring Language Models (SELM; Theoretical Version)
Open Source Code No Due to the absence of performant open-source online direct alignment codebases at the time of this study, we first implement an iterative version of DPO as the baseline, adhering to the same steps as Algorithm 1 but training the LLM with the standard DPO objective. [...] In experiments, we use the Alignment Handbook (Tunstall et al., 2023a) framework as our codebase.
Open Datasets Yes We adopt Ultra Feedback (Cui et al., 2023) as our training dataset, which contains 61k preference pairs of single-turn conversations. For the external ranker during online alignment, we choose the small-sized Pair RM (0.4B) (Jiang et al., 2023). [...] We first report the performance of SELM and the baselines on the instruction-following chat benchmarks Alpaca Eval 2.0 (Dubois et al., 2024) and MT-Bench (Zheng et al., 2024) in Table 1. [...] Beyond instruction-following benchmarks, we also evaluate SELM and the baselines on several academic benchmarks, including GSM8K (Cobbe et al., 2021), Hella Swag (Zellers et al., 2019), ARC challenge (Clark et al., 2018), Truthful QA (Lin et al., 2021), EQ-Bench (Paech, 2023), and Open Book QA (OBQA) (Mihaylov et al., 2018).
Dataset Splits Yes In practice, we split the offline preference dataset into three portions with equal sizes, one for each iteration. [...] Algorithm 1 Self-Exploring Language Models (SELM) [...] 2: Set Dt as the t-th portion of D and generate y πref( | x) for each prompt x in Dt.
Hardware Specification Yes All experiments are conducted on 8x A100 GPUs.
Software Dependencies No In experiments, we use the Alignment Handbook (Tunstall et al., 2023a) framework as our codebase. [...] In addition, we apply iterative DPO and SELM on instruction fine-tuned models. Specifically, we consider two series of LLMs: Zephyr (Tunstall et al., 2023b) and Llama-3 (Meta, 2024), to demonstrate the robustness of SELM.
Experiment Setup Yes We conduct a grid search over hyperparameters, such as the batch size, learning rate, and iteration number, to identify the optimal settings for the iterative DPO baseline. We follow these best settings to train SELM. [...] we set the iteration number to 3 for Llama3-It-based and Zephyr-based models (excluding the first iteration of DPO training). Besides, we observe that choosing different batch sizes has a large effect on the models performance and the optimal batch size heavily depends on the model architecture. In experiments, we set the batch size to 256 and 128 for the Zephyr-based and Llama3-It-based models, respectively. For the learning rate, we consider three design choices: cyclic learning rate with constant cycle amplitude, linearly decayed cycle amplitude, and decayed cycle amplitude at the last iteration. We find that a decaying cycle amplitude performs better than constant amplitudes in general. Thus, for Zephyr-based models, we set the learning to 5e 7 for the first three iterations and 1e 7 for the last iteration. In each iteration, the warmup ratio is 0.1. For Llama3-It-based models, we use a linearly decayed learning rate from 5e 7 to 1e 7 within 3 iterations with the same warmup ratio. [...] The optimism coefficient α is searched over 0.005, 0.001, 0.0005, and 0.0001 and is selected based on the average external reward on the holdout test set of Ultra Feedback. We set α = 0.001 for Zephyr-based SELM and α = 0.0001 for Llama3-It-based SELM.