ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks

Authors: Zhiyao Ren, Siyuan Liang, Aishan Liu, Dacheng Tao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (e.g., GPT-4).
Researcher Affiliation Academia 1Nanyang Technological University 2Beihang University. Correspondence to: Siyuan Liang <EMAIL>, Aishan Liu <EMAIL>, Dacheng Tao <EMAIL>.
Pseudocode No The paper describes the defense method in Section 5.2 'ICLShield Defense' using prose and mathematical formulas, not structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes For classification tasks in ICLAttack, we utilize SST-2 dataset (Socher et al., 2013) and AG s News dataset (Zhang et al., 2015); for generative tasks in ICLAttack, we adopt instruction datasets including Standford Alpaca (Taori et al., 2023) and Adv Bench (Zou et al., 2023); and for the reasoning task in Bad Chain, we employ an arithmetic reasoning dataset GSM8k (Cobbe et al., 2021) and a commonsense reasoning dataset CSQA (Talmor et al., 2018).
Dataset Splits No The paper mentions using several datasets but does not explicitly provide specific training, validation, or test dataset splits. It refers to previous works for attack settings, but not for the dataset splits themselves.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or processor types) used for running the experiments. It only lists the language models that were evaluated.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) used to implement and run the experiments.
Experiment Setup Yes ICLShield, a defense method against ICL backdoor attacks by combining a defensive demonstration Sd consisting k clean examples selected from dataset D = {xi, yi}p i=1 with the poisoned demonstration St. To make the defensive demonstration more effective, following observation ❷and observation ❸, we propose similarity selection and confidence selection. We select k/2 clean examples through similarity selection and confidence, respectively, and concatenate them to form the final defensive demonstration: Sd = Ss d + Sc d. ... When using 4, 5, 6, and 7 defensive examples, the ASR is further reduced by 18.48%, 35.75%, 51.05%, and 47.18%, respectively. Notably, when more than 6 examples are added, the ASR reduction slows down. Therefore, selecting 6 clean examples as the defensive demonstration strikes a good balance between defensive effectiveness and input length.