MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization

Authors: Yougang Lyu, Lingyong Yan, Zihan Wang, Dawei Yin, Pengjie Ren, Maarten de Rijke, Zhaochun Ren

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the HH-RLHF and PKU-Safe RLHF datasets, evaluated using both automatic metrics and human judgments, demonstrate that MACPO simultaneously improves the alignment performance of strong students and weak teachers.
Researcher Affiliation Collaboration 1University of Amsterdam 2Baidu Inc. 3Shandong University 4Leiden University
Pseudocode Yes Algorithm 1 Multi-Agent Contrastive Preference Optimization (MACPO)
Open Source Code Yes Our code and dataset are available at https://github.com/youganglyu/MACPO.
Open Datasets Yes HH-RLHF (Bai et al., 2022a) consists of conversations between humans and LLM assistants. Each sample contains a pair of conversations, with human annotators marking one conversation as preferred. The dataset includes a helpful subset (denoted as HH-Helpful) and a harmless subset (denoted as HH-Harmeless). We randomly filter samples from each subset to conduct experiments on weak-to-strong alignment, respectively. PKU-Safe RLHF (Dai et al., 2024) consists of conversation comparisons. Each comparison is annotated with two labels: a preference label indicating the human s choice between two responses and a harmless label associated with the preferred response, confirming whether it complies with safety standards.
Dataset Splits Yes HH-RLHF (Bai et al., 2022a): The dataset includes a helpfulness subset and a harmlessness subset. For each subset, we filter 10,000 samples for training and 2,000 samples for testing. Furthermore, we split the training set into two halves for weak teacher initialization and weak-to-strong alignment experiments, respectively. PKU-Safe RLHF (Dai et al., 2024): We filter 10,000 samples for training and 1,000 samples for testing. Specifically, we split the training set into two halves for weak teacher initialization and weak-to-strong alignment experiments, respectively.
Hardware Specification Yes All experiments are conducted on 8 80G A100 GPUs.
Software Dependencies No The paper mentions using the Adam W optimizer, PEFT, LLaMA-Factory, and LoRA, and cites their respective papers, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes During the training phase, weak teachers and strong students are initialized with SFT for 3 epochs, and then these models are trained with DPO for 1 epoch at each iteration. Morever, we use the Adam W optimizer (Loshchilov & Hutter, 2019) with initial learning rates of 5 10 5 for SFT and 1 10 5 for DPO. The batch sizes are 32 for SFT and 16 for DPO. The scalar weighting hyperparameter γ is set to 0.2.