Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
Authors: Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards. In this section, we present a comprehensive evaluation of our proposed Emoji Attack and token segmentation bias strategies against various Judge LLMs. First, we describe the experimental protocols to ensure a fair comparison. We then demonstrate how our proposed Emoji Attack improves jailbreak attacks against Judge LLM detection. |
| Researcher Affiliation | Academia | 1International Computer Science Institute, CA, USA 2UC Berkeley, CA, USA 3Lawrence Berkeley National Laboratory, CA, USA. Correspondence to: Zhipeng Wei <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Position Selection for cs-split. Input: A token xi = x1 i , . . . , x D i , embedding function Emb( ) from a surrogate model Output: Modified token ˆxi,j 1: Initialize S {} 2: for j=1 to D-1 do 3: Compute sj using Equation 3 4: Append sj to S 5: end for 6: Identify j := arg minj{sj} 7: return ˆxi,j = x1 i , . . . , xj 1 i xj i , . . . , x D i |
| Open Source Code | Yes | 1We provide research code to reproduce our results on Git Hub: https://github.com/zhipeng-wei/Emoji Attack. |
| Open Datasets | Yes | We used a data set of 402 short offensive phrases sourced from a publicly available list2. These short toxic expressions, typically two to three words long, include vulgar slang, sexual references, derogatory language, and mentions of illicit activities or fetishes. The example entries are shown in Table 1. ... Specifically, we sample 574 harmful responses from Adv Bench (Zou et al., 2023), which span various categories such as profanity and graphic content (ranging from 3 to 44 words). We also include 858 jailbreak-generated responses: 110 from LLM Self Defense (Phute et al., 2024) and 748 from Red Teaming Attempts (Ganguli et al., 2022). |
| Dataset Splits | No | The paper describes using collected datasets for evaluation (e.g., "Using this dataset, we evaluate whether fjudge correctly classifies them as unsafe"), but it does not specify explicit training, validation, or test splits for these datasets within the context of the experiments conducted in the paper. |
| Hardware Specification | No | We acknowledge the U.S. Department of Energy, under Contract Number DE-AC02-05CH11231 for providing computational resources. We used the computational cluster provided by NERSC and LBNL s Lawrencium. No specific hardware models (GPU/CPU/memory) are mentioned. |
| Software Dependencies | No | The paper mentions LLMs like Llama Guard, GPT-3.5, Gemini, Claude, and a surrogate model gtr-t5-xl, but does not list specific software libraries or frameworks with their version numbers that were used for the implementation of the Emoji Attack itself. |
| Experiment Setup | No | The paper describes the evaluation metrics (unsafe prediction ratio) and how harmfulness scores are determined for commercial LLMs based on existing literature. It also details the attack setting by combining Emoji Attack with existing jailbreak techniques. However, it does not provide specific hyperparameters for any training processes (e.g., learning rates, batch sizes, optimizers) or system-level settings for their own methodology (e.g., specific temperature or top-p settings if using LLMs for generating attack variations, beyond the general instruction). |