reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Authors: Wenkai Yang, Shiqi Shen, Guangyao Shen, Wei Yao, Yong Liu, Gong Zhi, Yankai Lin, Ji-Rong Wen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments in both the reward modeling and preference optimization scenarios, we find: (1) The weak-to-strong deception phenomenon exists across all settings. (2) The deception intensifies as the capability gap between weak and strong models increases. (3) Bootstrapping with an intermediate model can mitigate the deception to some extent, though its effectiveness remains limited. Our work highlights the urgent need to pay more attention to the true reliability of superalignment. 1
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2We Chat, Tencent Inc., Beijing, China
Pseudocode	No	The paper describes mathematical forms of objectives and loss functions, but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Corresponding Author 1Code is available at https://github.com/RUCBM/weak-to-strong-deception.
Open Datasets	Yes	We set the target alignment goal to let the weak model teach the strong model to be harmless. For this goal, we choose a popular single-turn harmless dataset CAI-Harmless (Bai et al., 2022b), which is an improved version of HH-RLHF (Bai et al., 2022a)... We introduce extra 4K ground truth helpful samples Dhelpful from HH-RLHF (Bai et al., 2022a)... We select and filter the honesty data from Unknown Bench (Liu et al., 2023).
Dataset Splits	Yes	We then randomly split the entire dataset into three parts: (1) Dgt: 4K ground truth samples for fine-tuning weak and strong base language models to get θgt w and θgt s . (2) Dweak: A held-out set of 4K samples in which data labels are predicted by the weak model and are used to weakly supervise the strong model. (3) Dtest: The last 4K testing samples for evaluating the generalization performance of all models and probing the deception phenomenon.
Hardware Specification	Yes	All experiments are conducted on 4 * NVIDIA A40 (40G) and 8 * NVIDIA A800 (80G).
Software Dependencies	No	The paper mentions using the Adam optimizer and that the code is based on another open-source code, but does not provide specific version numbers for any key software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	When fine-tuning both ground truth and weak-to-strong models, for each experiment, the batch size is 32, the learning rate is 1e-5, max seq len is set to 512. We use Adam (Kingma & Ba, 2015) optimizer in all experiments. The training epoch for all experiments is set to 1, in order to avoid over-fitting by following Burns et al. (2024). ... In both Sim PO and DPO settings, for each experiment, the batch size is 32, the learning rate is 1e-6, max seq len is set to 512... the epoch of SFT to be 1 for both methods. ... The number of epochs for preference optimization is 1 for Sim PO, and 3 for DPO... For Sim PO, the scaling factor β is fixed to 2.0 and the target reward margin γ is set to 1.0... For DPO, the scaling factor β is fixed to 0.1.