Durable Quantization Conditioned Misalignment Attack on Large Language Models

Authors: Peiran Dong, Haowei Li, Song Guo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Q-Misalign attack significantly increases jailbreak success rates in quantized models, while preserving model utility and safety alignment in full precision. Our findings highlight a critical gap in current LLM safety measures and call for more robust defenses in quantization-aware scenarios.
Researcher Affiliation Academia Peiran Dong Department of Computing Hong Kong Polytechnic University EMAIL; Haowei Li School of Cyber Science and Engineering Wuhan University EMAIL; Song Guo Department of Computer Science and Engineering Hong Kong University of Science and Technology EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical formulas, such as Lharm = 1 |Dharm| Pi=1 log P(rharm i |qharm i ) (1) for loss functions and θt+1 imp θt imp ... for updates, but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the release of source code for the methodology described, nor does it include a link to a code repository.
Open Datasets Yes We assessed the models vulnerabilities to jailbreak attacks using Adv Bench (Zou et al., 2023) and their general performance using Truthful QA MC2 (Lin et al., 2021). For downstream adaptation, we employed two instruction datasets: Alpaca (Taori et al., 2023) and Dolly-15k (Conover et al., 2023).
Dataset Splits No For phase 1, we fine-tuned the pre-trained models on the pure bad dataset (Qi et al., 2023), consisting of 100 harmful examples generated via red-teaming. For Phase 2, we followed the setup in Zhang et al. (2024), selecting 100 harmful instructions with rejective responses to unlearn harmful behavior and reject unsafe outputs (see equations 2, 3). Additionally, 500 benign query-response pairs were mixed with safety data to maintain overall model performance (see equation 4).
Hardware Specification No The paper discusses the deployment of LLMs on 'resource-constrained edge devices' but does not specify the hardware used for conducting its own experiments (e.g., GPU models, CPU types, or other compute resources).
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x) that would be needed to replicate the experimental environment.
Experiment Setup Yes For phase 1, we fine-tuned the pre-trained models on the pure bad dataset (Qi et al., 2023), consisting of 100 harmful examples generated via red-teaming. The fine-tuning process lasted for 10 epochs with a learning rate of 4e-6. For Phase 2, we followed the setup in Zhang et al. (2024), selecting 100 harmful instructions with rejective responses to unlearn harmful behavior and reject unsafe outputs (see equations 2, 3). Additionally, 500 benign query-response pairs were mixed with safety data to maintain overall model performance (see equation 4). We set the maximum number of epochs to 5 with a learning rate of 2e-5. For equation 2, the hyperparameter β = 1.0, and for equation 6, we set ϵ1 = 0.3, ϵ2 = 0.5, and ϵ3 = ϵ4 = 1.0.