reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks

Authors: Zhiyao Ren, Siyuan Liang, Aishan Liu, Dacheng Tao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (e.g., GPT-4).
Researcher Affiliation	Academia	1Nanyang Technological University 2Beihang University. Correspondence to: Siyuan Liang <EMAIL>, Aishan Liu <EMAIL>, Dacheng Tao <EMAIL>.
Pseudocode	No	The paper describes the defense method in Section 5.2 'ICLShield Defense' using prose and mathematical formulas, not structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository.
Open Datasets	Yes	For classification tasks in ICLAttack, we utilize SST-2 dataset (Socher et al., 2013) and AG s News dataset (Zhang et al., 2015); for generative tasks in ICLAttack, we adopt instruction datasets including Standford Alpaca (Taori et al., 2023) and Adv Bench (Zou et al., 2023); and for the reasoning task in Bad Chain, we employ an arithmetic reasoning dataset GSM8k (Cobbe et al., 2021) and a commonsense reasoning dataset CSQA (Talmor et al., 2018).
Dataset Splits	No	The paper mentions using several datasets but does not explicitly provide specific training, validation, or test dataset splits. It refers to previous works for attack settings, but not for the dataset splits themselves.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or processor types) used for running the experiments. It only lists the language models that were evaluated.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) used to implement and run the experiments.
Experiment Setup	Yes	ICLShield, a defense method against ICL backdoor attacks by combining a defensive demonstration Sd consisting k clean examples selected from dataset D = {xi, yi}p i=1 with the poisoned demonstration St. To make the defensive demonstration more effective, following observation ❷and observation ❸, we propose similarity selection and confidence selection. We select k/2 clean examples through similarity selection and confidence, respectively, and concatenate them to form the final defensive demonstration: Sd = Ss d + Sc d. ... When using 4, 5, 6, and 7 defensive examples, the ASR is further reduced by 18.48%, 35.75%, 51.05%, and 47.18%, respectively. Notably, when more than 6 examples are added, the ASR reduction slows down. Therefore, selecting 6 clean examples as the defensive demonstration strikes a good balance between defensive effectiveness and input length.