Weak-to-Strong Jailbreaking on Large Language Models
Authors: Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our experiments on five LLMs show that the weak-to-strong attack outperforms prior methods, achieving over 99% attack success rates on two datasets. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2UC Santa Barbara 3Sea AI Lab, Singapore 4Carnegie Mellon University 5UC San Diego. |
| Pseudocode | No | The paper describes the method using mathematical formulas such as M+(yt|q, y<t) = ... and prose, but does not include a figure, block, or section explicitly labeled "Pseudocode" or "Algorithm", nor structured steps formatted like code or an algorithm. |
| Open Source Code | Yes | The code for replicating the method is available at https://github. com/Xuandong Zhao/weak-to-strong. |
| Open Datasets | Yes | In the experiment, we use two benchmark datasets, Adv Bench (Zou et al., 2023) and Malicious Instruct (Huang et al., 2023), to evaluate the effectiveness of the weak-to-strong attack. ... For malicious questions, we use the Adv Bench dataset from Zou et al. (2023), and for general questions, we use the open question-answering dataset2. https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-en. ... we expanded our evaluation to include two new safety benchmarks: SALAD-Bench (Li et al., 2024) and SORRY-Bench (Xie et al., 2024). |
| Dataset Splits | No | For SALAD-Bench, we sampled 5 data points from each of the 66 categories in the base set, resulting in a total of 330 samples. For SORRY-Bench, we used 450 samples covering 45 categories. ... For each experiment, we use 100 adversarial examples from the released dataset Yang et al. (2023), which has no data overlap with Adv Bench or Malicious Instruct datasets. The paper describes sampling strategies and the number of examples used for fine-tuning, but does not provide explicit training/test/validation splits for the main datasets (Adv Bench, Malicious Instruct) used in the paper's primary evaluation. |
| Hardware Specification | Yes | All experiments are conducted using 4 A100 80G and 8 A100 40G GPUs. |
| Software Dependencies | No | We utilize the Stanford alpaca training system. ... We use sentence bleu from nltk.translate.bleu score for BLEU scores, rouge scorer for ROUGE scores, and all-Mini LM-L6-v2 from sentence transformers for sentence similarity. The paper mentions software tools and libraries but does not provide specific version numbers for them (e.g., Stanford alpaca, nltk, rouge scorer, sentence transformers). |
| Experiment Setup | Yes | The learning rate is set at 2e 5, with a per-device batch size of 8, and a gradient accumulation step of 1. The maximum text length is established at 1, 024, with a total of 15 training epochs. Additionally, we set the warm-up ratio to 0.03 and employ Fully Sharded Data Parallel (FSDP) for all computational tasks. For generation, we adhere to the fixed default settings with a temperature of 0.1 and a Top-p value of 0.9. ... with an amplification factor of α = 1.5, ... The α value is set to 1.0 for both settings. |