reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Alleviating Shifted Distribution in Human Preference Alignment through Meta-Learning

Authors: Shihan Dou, Yan Liu, Enyu Zhou, Songyang Gao, Tianlong Li, Limao Xiong, Xin Zhao, Haoxiang Jia, Junjie Ye, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Meta RM can iteratively enhance the performance of human preference alignment by improving the RM s capacity to identify subtle differences in samples of shifted distributions. To evaluate the effectiveness of Meta RM, we conduct extensive experiments on the Anthropic s HH-RLHF (Bai et al. 2022) and Open AI s summarization (Stiennon et al. 2020b) datasets.
Researcher Affiliation	Collaboration	1School of Computer Science, Fudan University, Shanghai, China 2Ant Group, Shanghai, China 3School of Computer Science, Peking University, Beijing, China
Pseudocode	Yes	Algorithm 1: The optimization process of Meta RM. Require: θ, D, S, n, m Require: η, α 1: for t = 0, , T 1 do 2: Sample a mini-batch Xt = {(xi, yi w, yi l), 1 i n} of size n from the preference pairs dataset D 3: Sample a mini-batch Xs = {(xi, si), 1 i m} of size m from the meta dataset S 4: Compute the difference loss Jθ(Xs) with the parameters θt on Xs 5: (Meta-process) Compute adapted parameters θ t with gradient ascent: θ t θt + η θJθ(Xs) 6: Compute the vanilla loss Lθ (Xt) with the parameters θ t on Xt 7: (Meta RM-optimization) Update the parameters θt with gradient descent: θt+1 θt α θ Lθ (Xt) 8: end for
Open Source Code	No	The paper does not explicitly state that source code for the methodology is released, nor does it provide a link to a repository.
Open Datasets	Yes	To evaluate the effectiveness of Meta RM, we conduct extensive experiments on the Anthropic s HH-RLHF (Bai et al. 2022) and Open AI s summarization (Stiennon et al. 2020b) datasets. ... For Human preference data, we use the Oasst1 dataset (K opf et al. 2024) as the helpfulness data of OOD task. ... we use PKU-Safe RLHF (Dai et al. 2024) as the harmlessness data
Dataset Splits	Yes	For Human preference data, we utilize Anthropic s HH-RLHF (Bai et al. 2022), a comprehensive collection of human preference concerning AI assistant responses (Bai et al. 2022). It contains 161k training samples and 8,500 testing samples including helpfulness and harmlessness data.
Hardware Specification	Yes	The fine-tuning process was conducted on a single node with eight Nvidia A100-80G GPUs and the global batch size is set to 32.
Software Dependencies	No	The paper mentions 'Llama-2 (Touvron et al. 2023) with seven billion parameters as the base model' but does not specify any programming languages, libraries, or other software with version numbers necessary for replication.
Experiment Setup	Yes	In the SFT phase, the learning rate is set to 2e 5, and we train two epochs with a linear decay to zero. We employ a warmup period of 0.3 epochs. ... In the reward modelling phase, the learning rate is set to 5e 6, and the global batch size is set to 16 for both the vanilla training phase and the meta-process phase. The training epoch on original preference pair datasets is only one for our proposed method and all baselines. For each optimization round of Meta RM, the learning rates α and η are both set to 5e 6. ... In the PPO phase, the learning rate for the policy model and critic model is 5e 7 and 1.5e 6. For each query, we collect 16 roll-out samples using nucleus sampling. the temperature, top-p and the repetition penalty in the sampling phase are set to 0.8, 0.9 and 1.1, respectively. We set the token-level KL penalty coefficient β to 0.05 with a clip value of 0.8.