Perception-Guided Jailbreak Against Text-to-Image Models

Authors: Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, Yang Liu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.
Researcher Affiliation Collaboration 1Nanyang Technological University, Singapore 2East China Normal University, China 3Wuhan University, China 4Key Laboratory of Cyberspace Security, Ministry of Education, China 5Shanghai Trusted Industrial Control Platform Co.,Ltd., China
Pseudocode No The paper describes the steps for "Unsafe word selection" and "Word substitution" using LLM instructions in prose with examples, but does not present a formally labeled pseudocode or algorithm block.
Open Source Code No The paper does not contain an explicit statement offering access to the source code for the proposed PGJ method, nor does it provide a link to a code repository.
Open Datasets No Thus we exploit GPT4 to generate a dataset with 1,000 prompts for five classical NSFW types: discrimination, illegal, pornographic, privacy, and violent. The paper does not provide a link, DOI, or specific repository name for accessing this generated dataset.
Dataset Splits Yes We select 20 prompts for each NSFW type, a total of 100 prompts. ... Each NSFW type is represented by 200 prompts.
Hardware Specification Yes All the experiments were run on an Ubuntu system with an NVIDIA A6000 Tensor Core GPU of 48G RAM.
Software Dependencies No The paper does not mention any specific software dependencies or libraries with their version numbers.
Experiment Setup Yes Victim T2I Models. We adopt six popular T2I models as the victims of our attack. They are DALL E 2 (Open AI 2021), DALL E 3 (Open AI 2023a), Cogview3 (Zhipu 2024), SDXL (Podell et al. 2023), Tongyiwanxiang (Ali 2023b), and Hunyuan (Tencent 2024). ... Datasets. ... We exploit GPT4 to generate a dataset with 1,000 prompts for five classical NSFW types: discrimination, illegal, pornographic, privacy, and violent. ... Baselines. ... Evaluation metrics. We use four metrics to evaluate the experiment. ❶We use the attack success rate (ASR) metric... ❷We use the semantic consistency (SC) metric... ❸We use prompt perplexity (PPL) as a metric... ❹We use the Inception Score (IS)...