Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

Authors: Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HARMSCORE, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and SPEAK EASY, a simple multi-step, multilingual attack framework. Notably, by incorporating SPEAK EASY into direct request and jailbreak baselines, we see an average absolute increase of 0.319 in Attack Success Rate and 0.426 in HARMSCORE in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions. ... We summarize our contributions in this paper as follows: ... We show that SPEAK EASY, a simple multi-step and multilingual jailbreak framework, significantly increases the likelihood of harmful responses in both proprietary and open-source LLMs.
Researcher Affiliation Academia 1Brown University 2Columbia University 3Massachusetts Institute of Technology. Correspondence to: Yik Siu Chan <yik siu EMAIL>.
Pseudocode No The paper describes the SPEAK EASY jailbreak framework with a conceptual diagram in Figure 3, but it does not provide any formal pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code Yes 1Our code is available at https://github.com/ yiksiu-chan/Speak Easy.
Open Datasets Yes We evaluate on four jailbreak benchmarks, covering a wide range of harm categories: (1) Harm Bench (Mazeika et al., 2024) with its standard split of 200 single-sentence queries; (2) Adv Bench (Zou et al., 2023) with 520 harmful instructions; (3) SORRY-Bench (Xie et al., 2024) with 450 harmful instructions; (4) Med Safety Bench (Han et al., 2024), where we randomly sample 50 examples from each of the nine medical harm categories, totaling 450 instances. ... We preprocess the HH-RLHF (Bai et al., 2022a) and Stack-Exchange-Preferences (Lambert et al., 2023) datasets by filtering out irrelevant instances.
Dataset Splits Yes Harm Bench (Mazeika et al., 2024) with its standard split of 200 single-sentence queries; (2) Adv Bench (Zou et al., 2023) with 520 harmful instructions; (3) SORRY-Bench (Xie et al., 2024) with 450 harmful instructions; (4) Med Safety Bench (Han et al., 2024), where we randomly sample 50 examples from each of the nine medical harm categories, totaling 450 instances. ... This process yields a preference dataset comprising 27,000 valid query-preference pairs for each attribute... We construct preference test sets using the human evaluation data from 3.1. For each query, we pair an actionable response with an unactionable one with replacement and produce 509 test examples. We curate 455 examples for informativeness with the same procedure.
Hardware Specification Yes We trained the model with a cosine scheduler, a warmup ratio of 0.03, and bf16 precision. DPO preference tuning was performed on one A100 GPU for both response selection models.
Software Dependencies No The paper mentions using specific LLMs like 'Llama-3.1-8B-Instruct' and services like 'Azure AI Translator (Azure, 2024)', but does not provide specific version numbers for underlying software libraries or frameworks such as Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup Yes For SPEAK EASY, we use three steps (m = 3) for query decomposition and six languages (n = 6) to exploit multilingual vulnerabilities. ... We followed the implementation by Dong et al. (2023) and used a learning rate of 2 10 6 with a linear decay rate of 0.999 over 8 epochs and a batch size of 64. ... We set max tokens to 256. ... For training, we use the Vicuna-7B and Vicuna-13B models (Chiang et al., 2023) and randomly sample 25 harmful queries from the benchmark dataset. The suffix yielding the lowest loss after 100 optimization steps is selected.