MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
Authors: Yougang Lyu, Lingyong Yan, Zihan Wang, Dawei Yin, Pengjie Ren, Maarten de Rijke, Zhaochun Ren
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the HH-RLHF and PKU-Safe RLHF datasets, evaluated using both automatic metrics and human judgments, demonstrate that MACPO simultaneously improves the alignment performance of strong students and weak teachers. |
| Researcher Affiliation | Collaboration | 1University of Amsterdam 2Baidu Inc. 3Shandong University 4Leiden University |
| Pseudocode | Yes | Algorithm 1 Multi-Agent Contrastive Preference Optimization (MACPO) |
| Open Source Code | Yes | Our code and dataset are available at https://github.com/youganglyu/MACPO. |
| Open Datasets | Yes | HH-RLHF (Bai et al., 2022a) consists of conversations between humans and LLM assistants. Each sample contains a pair of conversations, with human annotators marking one conversation as preferred. The dataset includes a helpful subset (denoted as HH-Helpful) and a harmless subset (denoted as HH-Harmeless). We randomly filter samples from each subset to conduct experiments on weak-to-strong alignment, respectively. PKU-Safe RLHF (Dai et al., 2024) consists of conversation comparisons. Each comparison is annotated with two labels: a preference label indicating the human s choice between two responses and a harmless label associated with the preferred response, confirming whether it complies with safety standards. |
| Dataset Splits | Yes | HH-RLHF (Bai et al., 2022a): The dataset includes a helpfulness subset and a harmlessness subset. For each subset, we filter 10,000 samples for training and 2,000 samples for testing. Furthermore, we split the training set into two halves for weak teacher initialization and weak-to-strong alignment experiments, respectively. PKU-Safe RLHF (Dai et al., 2024): We filter 10,000 samples for training and 1,000 samples for testing. Specifically, we split the training set into two halves for weak teacher initialization and weak-to-strong alignment experiments, respectively. |
| Hardware Specification | Yes | All experiments are conducted on 8 80G A100 GPUs. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer, PEFT, LLaMA-Factory, and LoRA, and cites their respective papers, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | During the training phase, weak teachers and strong students are initialized with SFT for 3 epochs, and then these models are trained with DPO for 1 epoch at each iteration. Morever, we use the Adam W optimizer (Loshchilov & Hutter, 2019) with initial learning rates of 5 10 5 for SFT and 1 10 5 for DPO. The batch sizes are 32 for SFT and 16 for DPO. The scalar weighting hyperparameter γ is set to 0.2. |