Alleviating Shifted Distribution in Human Preference Alignment through Meta-Learning

Authors: Shihan Dou, Yan Liu, Enyu Zhou, Songyang Gao, Tianlong Li, Limao Xiong, Xin Zhao, Haoxiang Jia, Junjie Ye, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Meta RM can iteratively enhance the performance of human preference alignment by improving the RM s capacity to identify subtle differences in samples of shifted distributions. To evaluate the effectiveness of Meta RM, we conduct extensive experiments on the Anthropic s HH-RLHF (Bai et al. 2022) and Open AI s summarization (Stiennon et al. 2020b) datasets.
Researcher Affiliation Collaboration 1School of Computer Science, Fudan University, Shanghai, China 2Ant Group, Shanghai, China 3School of Computer Science, Peking University, Beijing, China
Pseudocode Yes Algorithm 1: The optimization process of Meta RM. Require: θ, D, S, n, m Require: η, α 1: for t = 0, , T 1 do 2: Sample a mini-batch Xt = {(xi, yi w, yi l), 1 i n} of size n from the preference pairs dataset D 3: Sample a mini-batch Xs = {(xi, si), 1 i m} of size m from the meta dataset S 4: Compute the difference loss Jθ(Xs) with the parameters θt on Xs 5: (Meta-process) Compute adapted parameters θ t with gradient ascent: θ t θt + η θJθ(Xs) 6: Compute the vanilla loss Lθ (Xt) with the parameters θ t on Xt 7: (Meta RM-optimization) Update the parameters θt with gradient descent: θt+1 θt α θ Lθ (Xt) 8: end for
Open Source Code No The paper does not explicitly state that source code for the methodology is released, nor does it provide a link to a repository.
Open Datasets Yes To evaluate the effectiveness of Meta RM, we conduct extensive experiments on the Anthropic s HH-RLHF (Bai et al. 2022) and Open AI s summarization (Stiennon et al. 2020b) datasets. ... For Human preference data, we use the Oasst1 dataset (K opf et al. 2024) as the helpfulness data of OOD task. ... we use PKU-Safe RLHF (Dai et al. 2024) as the harmlessness data
Dataset Splits Yes For Human preference data, we utilize Anthropic s HH-RLHF (Bai et al. 2022), a comprehensive collection of human preference concerning AI assistant responses (Bai et al. 2022). It contains 161k training samples and 8,500 testing samples including helpfulness and harmlessness data.
Hardware Specification Yes The fine-tuning process was conducted on a single node with eight Nvidia A100-80G GPUs and the global batch size is set to 32.
Software Dependencies No The paper mentions 'Llama-2 (Touvron et al. 2023) with seven billion parameters as the base model' but does not specify any programming languages, libraries, or other software with version numbers necessary for replication.
Experiment Setup Yes In the SFT phase, the learning rate is set to 2e 5, and we train two epochs with a linear decay to zero. We employ a warmup period of 0.3 epochs. ... In the reward modelling phase, the learning rate is set to 5e 6, and the global batch size is set to 16 for both the vanilla training phase and the meta-process phase. The training epoch on original preference pair datasets is only one for our proposed method and all baselines. For each optimization round of Meta RM, the learning rates α and η are both set to 5e 6. ... In the PPO phase, the learning rate for the policy model and critic model is 5e 7 and 1.5e 6. For each query, we collect 16 roll-out samples using nucleus sampling. the temperature, top-p and the repetition penalty in the sampling phase are set to 0.8, 0.9 and 1.1, respectively. We set the token-level KL penalty coefficient β to 0.05 with a clip value of 0.8.