reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Weak-to-Strong Jailbreaking on Large Language Models

Authors: Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our experiments on five LLMs show that the weak-to-strong attack outperforms prior methods, achieving over 99% attack success rates on two datasets.
Researcher Affiliation	Collaboration	1UC Berkeley 2UC Santa Barbara 3Sea AI Lab, Singapore 4Carnegie Mellon University 5UC San Diego.
Pseudocode	No	The paper describes the method using mathematical formulas such as M+(yt\|q, y<t) = ... and prose, but does not include a figure, block, or section explicitly labeled "Pseudocode" or "Algorithm", nor structured steps formatted like code or an algorithm.
Open Source Code	Yes	The code for replicating the method is available at https://github. com/Xuandong Zhao/weak-to-strong.
Open Datasets	Yes	In the experiment, we use two benchmark datasets, Adv Bench (Zou et al., 2023) and Malicious Instruct (Huang et al., 2023), to evaluate the effectiveness of the weak-to-strong attack. ... For malicious questions, we use the Adv Bench dataset from Zou et al. (2023), and for general questions, we use the open question-answering dataset2. https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-en. ... we expanded our evaluation to include two new safety benchmarks: SALAD-Bench (Li et al., 2024) and SORRY-Bench (Xie et al., 2024).
Dataset Splits	No	For SALAD-Bench, we sampled 5 data points from each of the 66 categories in the base set, resulting in a total of 330 samples. For SORRY-Bench, we used 450 samples covering 45 categories. ... For each experiment, we use 100 adversarial examples from the released dataset Yang et al. (2023), which has no data overlap with Adv Bench or Malicious Instruct datasets. The paper describes sampling strategies and the number of examples used for fine-tuning, but does not provide explicit training/test/validation splits for the main datasets (Adv Bench, Malicious Instruct) used in the paper's primary evaluation.
Hardware Specification	Yes	All experiments are conducted using 4 A100 80G and 8 A100 40G GPUs.
Software Dependencies	No	We utilize the Stanford alpaca training system. ... We use sentence bleu from nltk.translate.bleu score for BLEU scores, rouge scorer for ROUGE scores, and all-Mini LM-L6-v2 from sentence transformers for sentence similarity. The paper mentions software tools and libraries but does not provide specific version numbers for them (e.g., Stanford alpaca, nltk, rouge scorer, sentence transformers).
Experiment Setup	Yes	The learning rate is set at 2e 5, with a per-device batch size of 8, and a gradient accumulation step of 1. The maximum text length is established at 1, 024, with a total of 15 training epochs. Additionally, we set the warm-up ratio to 0.03 and employ Fully Sharded Data Parallel (FSDP) for all computational tasks. For generation, we adhere to the fixed default settings with a temperature of 0.1 and a Top-p value of 0.9. ... with an amplification factor of α = 1.5, ... The α value is set to 1.0 for both settings.