Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models

Authors: Haoran Li, Yulin Chen, Zihao Zheng, Qi Hu, Chunkit Chan, Heshan Liu, Yangqiu Song

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments to show that our proposed SANDE is effective against backdoor attacks while bringing minimal harm to LLMs powerful capability. Experiments Experimental Settings In our experimental setup, we always operate under the assumption that our SANDE only has access to fb. Evaluation Metrics. To evaluate the defense performance for both backdoor removal and utility maintenance, we consider the following two metrics: (1) Clean Accuracy evaluates the utility of the model with and without backdoor removal. ... (2) Attack Success Rate (ASR) calculates the percentage of poisoned samples that contains the malicious triggered response when the trigger appears in the instruction context. Table 1: ASR evaluation on in-domain removal. The results are reported in %. Llama2-Alpaca indicates that the victim model, Llama2, is fine-tuned on the Stanford Alpaca dataset. The same format applies to the other three cases. Table 2: Utility evaluation after in-domain removal. All the results are reported in %.
Researcher Affiliation Academia 1The Hong Kong University of Science and Technology 2National University of Singapore 3Harbin Institute of Technology, Shenzhen 4Independent Researcher EMAIL, EMAIL, EMAIL EMAIL, EMAIL
Pseudocode No The paper describes methods using textual explanations and mathematical formulations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1Code is publicly available at https://github.com/HKUSTKnowComp/SANDE.
Open Datasets Yes For our experiments, we use open-sourced LLMs, including Llama2-7b (Touvron et al. 2023) and Qwen1.5-4b (Bai et al. 2023) as the victim models. We choose Stanford Alpaca (Taori et al. 2023) and 200,000 samples from Open Orca (Lian et al. 2023) for SFT. Clean Accuracy evaluates the utility of the model with and without backdoor removal. To be specific, we evaluate LLMs performance with Massive Multitask Language Understanding (MMLU) dataset (Hendrycks et al. 2021a,b) and AI2 Reasoning Challenge (ARC) dataset (Clark et al. 2018) in zero-shot setting.
Dataset Splits Yes We select 90% of the dataset as the training data and the remaining 10% as the test data. We randomly poison 5% of the training dataset.
Hardware Specification No In terms of batch size, we set batch size = 4 for Llama2 and batch size = 8 for Qwen1.5 trained on two graphic cards.
Software Dependencies No The paper mentions the use of the Adam optimizer and open-sourced LLMs (Llama2-7b, Qwen1.5-4b) but does not provide specific version numbers for any key software components, libraries, or programming languages used.
Experiment Setup Yes During the SFT step, we set the Adam (Kingma and Ba 2014) as the optimizer with eps = 1e-8, and betas = (0.9, 0.95). We set the learning rate to 5e-6 for Llama2 and 2e5 for Qwen1.5. We set epochs to 2 for training on Stanford Alpaca and 1 on Open Orca. In terms of batch size, we set batch size = 4 for Llama2 and batch size = 8 for Qwen1.5 trained on two graphic cards. The max length is 1024. There is no weight decay in the model and no gradient accumulation. For OSFT, we set the learning rate to 2e-5 for both models, and the other settings are the same as inserting a backdoor. For SANDE, we use the learning rate from 2e-5 to 4e-5 and the other settings are the same as the previous.