reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Durable Quantization Conditioned Misalignment Attack on Large Language Models

Authors: Peiran Dong, Haowei Li, Song Guo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Q-Misalign attack significantly increases jailbreak success rates in quantized models, while preserving model utility and safety alignment in full precision. Our findings highlight a critical gap in current LLM safety measures and call for more robust defenses in quantization-aware scenarios.
Researcher Affiliation	Academia	Peiran Dong Department of Computing Hong Kong Polytechnic University EMAIL; Haowei Li School of Cyber Science and Engineering Wuhan University EMAIL; Song Guo Department of Computer Science and Engineering Hong Kong University of Science and Technology EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical formulas, such as Lharm = 1 \|Dharm\| Pi=1 log P(rharm i \|qharm i ) (1) for loss functions and θt+1 imp θt imp ... for updates, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about the release of source code for the methodology described, nor does it include a link to a code repository.
Open Datasets	Yes	We assessed the models vulnerabilities to jailbreak attacks using Adv Bench (Zou et al., 2023) and their general performance using Truthful QA MC2 (Lin et al., 2021). For downstream adaptation, we employed two instruction datasets: Alpaca (Taori et al., 2023) and Dolly-15k (Conover et al., 2023).
Dataset Splits	No	For phase 1, we fine-tuned the pre-trained models on the pure bad dataset (Qi et al., 2023), consisting of 100 harmful examples generated via red-teaming. For Phase 2, we followed the setup in Zhang et al. (2024), selecting 100 harmful instructions with rejective responses to unlearn harmful behavior and reject unsafe outputs (see equations 2, 3). Additionally, 500 benign query-response pairs were mixed with safety data to maintain overall model performance (see equation 4).
Hardware Specification	No	The paper discusses the deployment of LLMs on 'resource-constrained edge devices' but does not specify the hardware used for conducting its own experiments (e.g., GPU models, CPU types, or other compute resources).
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x) that would be needed to replicate the experimental environment.
Experiment Setup	Yes	For phase 1, we fine-tuned the pre-trained models on the pure bad dataset (Qi et al., 2023), consisting of 100 harmful examples generated via red-teaming. The fine-tuning process lasted for 10 epochs with a learning rate of 4e-6. For Phase 2, we followed the setup in Zhang et al. (2024), selecting 100 harmful instructions with rejective responses to unlearn harmful behavior and reject unsafe outputs (see equations 2, 3). Additionally, 500 benign query-response pairs were mixed with safety data to maintain overall model performance (see equation 4). We set the maximum number of epochs to 5 with a learning rate of 2e-5. For equation 2, the hyperparameter β = 1.0, and for equation 6, we set ϵ1 = 0.3, ϵ2 = 0.5, and ϵ3 = ϵ4 = 1.0.