ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks
Authors: Zhiyao Ren, Siyuan Liang, Aishan Liu, Dacheng Tao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (e.g., GPT-4). |
| Researcher Affiliation | Academia | 1Nanyang Technological University 2Beihang University. Correspondence to: Siyuan Liang <EMAIL>, Aishan Liu <EMAIL>, Dacheng Tao <EMAIL>. |
| Pseudocode | No | The paper describes the defense method in Section 5.2 'ICLShield Defense' using prose and mathematical formulas, not structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | Yes | For classification tasks in ICLAttack, we utilize SST-2 dataset (Socher et al., 2013) and AG s News dataset (Zhang et al., 2015); for generative tasks in ICLAttack, we adopt instruction datasets including Standford Alpaca (Taori et al., 2023) and Adv Bench (Zou et al., 2023); and for the reasoning task in Bad Chain, we employ an arithmetic reasoning dataset GSM8k (Cobbe et al., 2021) and a commonsense reasoning dataset CSQA (Talmor et al., 2018). |
| Dataset Splits | No | The paper mentions using several datasets but does not explicitly provide specific training, validation, or test dataset splits. It refers to previous works for attack settings, but not for the dataset splits themselves. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or processor types) used for running the experiments. It only lists the language models that were evaluated. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) used to implement and run the experiments. |
| Experiment Setup | Yes | ICLShield, a defense method against ICL backdoor attacks by combining a defensive demonstration Sd consisting k clean examples selected from dataset D = {xi, yi}p i=1 with the poisoned demonstration St. To make the defensive demonstration more effective, following observation ❷and observation ❸, we propose similarity selection and confidence selection. We select k/2 clean examples through similarity selection and confidence, respectively, and concatenate them to form the final defensive demonstration: Sd = Ss d + Sc d. ... When using 4, 5, 6, and 7 defensive examples, the ASR is further reduced by 18.48%, 35.75%, 51.05%, and 47.18%, respectively. Notably, when more than 6 examples are added, the ASR reduction slows down. Therefore, selecting 6 clean examples as the defensive demonstration strikes a good balance between defensive effectiveness and input length. |