Perception-Guided Jailbreak Against Text-to-Image Models
Authors: Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, Yang Liu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ. |
| Researcher Affiliation | Collaboration | 1Nanyang Technological University, Singapore 2East China Normal University, China 3Wuhan University, China 4Key Laboratory of Cyberspace Security, Ministry of Education, China 5Shanghai Trusted Industrial Control Platform Co.,Ltd., China |
| Pseudocode | No | The paper describes the steps for "Unsafe word selection" and "Word substitution" using LLM instructions in prose with examples, but does not present a formally labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain an explicit statement offering access to the source code for the proposed PGJ method, nor does it provide a link to a code repository. |
| Open Datasets | No | Thus we exploit GPT4 to generate a dataset with 1,000 prompts for five classical NSFW types: discrimination, illegal, pornographic, privacy, and violent. The paper does not provide a link, DOI, or specific repository name for accessing this generated dataset. |
| Dataset Splits | Yes | We select 20 prompts for each NSFW type, a total of 100 prompts. ... Each NSFW type is represented by 200 prompts. |
| Hardware Specification | Yes | All the experiments were run on an Ubuntu system with an NVIDIA A6000 Tensor Core GPU of 48G RAM. |
| Software Dependencies | No | The paper does not mention any specific software dependencies or libraries with their version numbers. |
| Experiment Setup | Yes | Victim T2I Models. We adopt six popular T2I models as the victims of our attack. They are DALL E 2 (Open AI 2021), DALL E 3 (Open AI 2023a), Cogview3 (Zhipu 2024), SDXL (Podell et al. 2023), Tongyiwanxiang (Ali 2023b), and Hunyuan (Tencent 2024). ... Datasets. ... We exploit GPT4 to generate a dataset with 1,000 prompts for five classical NSFW types: discrimination, illegal, pornographic, privacy, and violent. ... Baselines. ... Evaluation metrics. We use four metrics to evaluate the experiment. ❶We use the attack success rate (ASR) metric... ❷We use the semantic consistency (SC) metric... ❸We use prompt perplexity (PPL) as a metric... ❹We use the Inception Score (IS)... |