reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization

Authors: Yougang Lyu, Lingyong Yan, Zihan Wang, Dawei Yin, Pengjie Ren, Maarten de Rijke, Zhaochun Ren

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on the HH-RLHF and PKU-Safe RLHF datasets, evaluated using both automatic metrics and human judgments, demonstrate that MACPO simultaneously improves the alignment performance of strong students and weak teachers.
Researcher Affiliation	Collaboration	1University of Amsterdam 2Baidu Inc. 3Shandong University 4Leiden University
Pseudocode	Yes	Algorithm 1 Multi-Agent Contrastive Preference Optimization (MACPO)
Open Source Code	Yes	Our code and dataset are available at https://github.com/youganglyu/MACPO.
Open Datasets	Yes	HH-RLHF (Bai et al., 2022a) consists of conversations between humans and LLM assistants. Each sample contains a pair of conversations, with human annotators marking one conversation as preferred. The dataset includes a helpful subset (denoted as HH-Helpful) and a harmless subset (denoted as HH-Harmeless). We randomly filter samples from each subset to conduct experiments on weak-to-strong alignment, respectively. PKU-Safe RLHF (Dai et al., 2024) consists of conversation comparisons. Each comparison is annotated with two labels: a preference label indicating the human s choice between two responses and a harmless label associated with the preferred response, confirming whether it complies with safety standards.
Dataset Splits	Yes	HH-RLHF (Bai et al., 2022a): The dataset includes a helpfulness subset and a harmlessness subset. For each subset, we filter 10,000 samples for training and 2,000 samples for testing. Furthermore, we split the training set into two halves for weak teacher initialization and weak-to-strong alignment experiments, respectively. PKU-Safe RLHF (Dai et al., 2024): We filter 10,000 samples for training and 1,000 samples for testing. Specifically, we split the training set into two halves for weak teacher initialization and weak-to-strong alignment experiments, respectively.
Hardware Specification	Yes	All experiments are conducted on 8 80G A100 GPUs.
Software Dependencies	No	The paper mentions using the Adam W optimizer, PEFT, LLaMA-Factory, and LoRA, and cites their respective papers, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	During the training phase, weak teachers and strong students are initialized with SFT for 3 epochs, and then these models are trained with DPO for 1 epoch at each iteration. Morever, we use the Adam W optimizer (Loshchilov & Hutter, 2019) with initial learning rates of 5 10 5 for SFT and 1 10 5 for DPO. The batch sizes are 32 for SFT and 16 for DPO. The scalar weighting hyperparameter γ is set to 0.2.