reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

Authors: Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards. In this section, we present a comprehensive evaluation of our proposed Emoji Attack and token segmentation bias strategies against various Judge LLMs. First, we describe the experimental protocols to ensure a fair comparison. We then demonstrate how our proposed Emoji Attack improves jailbreak attacks against Judge LLM detection.
Researcher Affiliation	Academia	1International Computer Science Institute, CA, USA 2UC Berkeley, CA, USA 3Lawrence Berkeley National Laboratory, CA, USA. Correspondence to: Zhipeng Wei <EMAIL>.
Pseudocode	Yes	Algorithm 1 Position Selection for cs-split. Input: A token xi = x1 i , . . . , x D i , embedding function Emb( ) from a surrogate model Output: Modified token ˆxi,j 1: Initialize S {} 2: for j=1 to D-1 do 3: Compute sj using Equation 3 4: Append sj to S 5: end for 6: Identify j := arg minj{sj} 7: return ˆxi,j = x1 i , . . . , xj 1 i xj i , . . . , x D i
Open Source Code	Yes	1We provide research code to reproduce our results on Git Hub: https://github.com/zhipeng-wei/Emoji Attack.
Open Datasets	Yes	We used a data set of 402 short offensive phrases sourced from a publicly available list2. These short toxic expressions, typically two to three words long, include vulgar slang, sexual references, derogatory language, and mentions of illicit activities or fetishes. The example entries are shown in Table 1. ... Specifically, we sample 574 harmful responses from Adv Bench (Zou et al., 2023), which span various categories such as profanity and graphic content (ranging from 3 to 44 words). We also include 858 jailbreak-generated responses: 110 from LLM Self Defense (Phute et al., 2024) and 748 from Red Teaming Attempts (Ganguli et al., 2022).
Dataset Splits	No	The paper describes using collected datasets for evaluation (e.g., "Using this dataset, we evaluate whether fjudge correctly classifies them as unsafe"), but it does not specify explicit training, validation, or test splits for these datasets within the context of the experiments conducted in the paper.
Hardware Specification	No	We acknowledge the U.S. Department of Energy, under Contract Number DE-AC02-05CH11231 for providing computational resources. We used the computational cluster provided by NERSC and LBNL s Lawrencium. No specific hardware models (GPU/CPU/memory) are mentioned.
Software Dependencies	No	The paper mentions LLMs like Llama Guard, GPT-3.5, Gemini, Claude, and a surrogate model gtr-t5-xl, but does not list specific software libraries or frameworks with their version numbers that were used for the implementation of the Emoji Attack itself.
Experiment Setup	No	The paper describes the evaluation metrics (unsafe prediction ratio) and how harmfulness scores are determined for commercial LLMs based on existing literature. It also details the attack setting by combining Emoji Attack with existing jailbreak techniques. However, it does not provide specific hyperparameters for any training processes (e.g., learning rates, batch sizes, optimizers) or system-level settings for their own methodology (e.g., specific temperature or top-p settings if using LLMs for generating attack variations, beyond the general instruction).