AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Authors: Mintong Kang, Chejian Xu, Bo Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on multiple advanced ALMs demonstrate that Adv Wave outperforms baseline methods, achieving a 40% higher average jailbreak attack success rate. Both audio stealthiness metrics and human evaluations confirm that adversarial audio generated by Adv Wave is indistinguishable from natural sounds.
Researcher Affiliation Academia Mintong Kang & Chejian Xu & Bo Li University of Illinois at Urbana Champaign EMAIL
Pseudocode No The paper describes methods through text and mathematical formulations (e.g., Equations 1-7) and high-level diagrams (Figure 1), but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes As Adv Bench (Zou et al., 2023) is widely used for jailbreak evaluations in text domain (Liu et al., 2023a; Chao et al., 2023; Mehrotra et al., 2023), we adapted its text-based queries into audio format using Open AI s TTS APIs, creating the Adv Bench-Audio dataset. Adv Bench Audio contains 520 audio queries, each requesting instructions on unethical or illegal activities.
Dataset Splits No Adv Bench Audio contains 520 audio queries, each requesting instructions on unethical or illegal activities. The paper does not provide specific training/validation/test splits for this dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types) used for running its experiments.
Software Dependencies No The paper mentions using GPT-4o and Open AI's TTS APIs as tools, and the Qwen2-Audio model for classification, but does not provide specific software dependency names with version numbers (e.g., libraries, frameworks, or programming language versions) for its own implementation.
Experiment Setup Yes We implement the adversarial loss Ladv as the Cross-Entropy loss between ALM output likelihoods and the adaptively searched adversarial targets. We fix the slack margin α as 1.0 for in the alignment loss Lalign. We use Qwen2Audio model to implement the audio classifier to impose classifier guidance Lstealth following the prompts in Appendix A.3. For Adv Wave optimization, we set a maximum of 3000 epochs, with an early stopping criterion if the loss falls below 0.1. We optimize the adversarial noise towards the sound of car horn by default, but we also evaluate diverse environmental noises in Section 4.4.