Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Authors: Wenkai Yang, Shiqi Shen, Guangyao Shen, Wei Yao, Yong Liu, Gong Zhi, Yankai Lin, Ji-Rong Wen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments in both the reward modeling and preference optimization scenarios, we find: (1) The weak-to-strong deception phenomenon exists across all settings. (2) The deception intensifies as the capability gap between weak and strong models increases. (3) Bootstrapping with an intermediate model can mitigate the deception to some extent, though its effectiveness remains limited. Our work highlights the urgent need to pay more attention to the true reliability of superalignment. 1 |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2We Chat, Tencent Inc., Beijing, China |
| Pseudocode | No | The paper describes mathematical forms of objectives and loss functions, but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Corresponding Author 1Code is available at https://github.com/RUCBM/weak-to-strong-deception. |
| Open Datasets | Yes | We set the target alignment goal to let the weak model teach the strong model to be harmless. For this goal, we choose a popular single-turn harmless dataset CAI-Harmless (Bai et al., 2022b), which is an improved version of HH-RLHF (Bai et al., 2022a)... We introduce extra 4K ground truth helpful samples Dhelpful from HH-RLHF (Bai et al., 2022a)... We select and filter the honesty data from Unknown Bench (Liu et al., 2023). |
| Dataset Splits | Yes | We then randomly split the entire dataset into three parts: (1) Dgt: 4K ground truth samples for fine-tuning weak and strong base language models to get θgt w and θgt s . (2) Dweak: A held-out set of 4K samples in which data labels are predicted by the weak model and are used to weakly supervise the strong model. (3) Dtest: The last 4K testing samples for evaluating the generalization performance of all models and probing the deception phenomenon. |
| Hardware Specification | Yes | All experiments are conducted on 4 * NVIDIA A40 (40G) and 8 * NVIDIA A800 (80G). |
| Software Dependencies | No | The paper mentions using the Adam optimizer and that the code is based on another open-source code, but does not provide specific version numbers for any key software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | When fine-tuning both ground truth and weak-to-strong models, for each experiment, the batch size is 32, the learning rate is 1e-5, max seq len is set to 512. We use Adam (Kingma & Ba, 2015) optimizer in all experiments. The training epoch for all experiments is set to 1, in order to avoid over-fitting by following Burns et al. (2024). ... In both Sim PO and DPO settings, for each experiment, the batch size is 32, the learning rate is 1e-6, max seq len is set to 512... the epoch of SFT to be 1 for both methods. ... The number of epochs for preference optimization is 1 for Sim PO, and 3 for DPO... For Sim PO, the scaling factor β is fixed to 2.0 and the target reward margin γ is set to 1.0... For DPO, the scaling factor β is fixed to 0.1. |