RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction
Authors: Tanqiu Jiang, Zian Wang, Jiacheng Liang, Changjiang Li, Yuhui Wang, Ting Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluation using benchmark datasets and models demonstrates that Robust KV effectively counters state-of-the-art jailbreak attacks while maintaining the LLM s general performance on benign queries. Moreover, Robust KV creates an intriguing evasiveness dilemma for adversaries, forcing them to balance between evading Robust KV and bypassing the LLM s built-in safeguards. This trade-off contributes to Robust KV s robustness against adaptive attacks. |
| Researcher Affiliation | Academia | Tanqiu Jiang Zian Wang Jiacheng Liang Changjiang Li Yuhui Wang Ting Wang Stony Brook University |
| Pseudocode | Yes | Algorithm 1: Robust KV. Input: input X, LLM M, eviction rate p Output: response R |
| Open Source Code | Yes | The code is available at: https://github.com/Tanqiu Jiang/Robust KV (warning: this paper contains potentially harmful content generated by LLMs.) |
| Open Datasets | Yes | Datasets. To evaluate the attack/defense effectiveness, we use the dataset containing 520 malicious prompts from the Adv Bench (Zou et al., 2023) benchmark. To assess LLMs performance on benign prompts, we use the Alpaca Eval (Dubois et al., 2023) and Vicuna Eval (Chiang et al., 2023) datasets for short-text tasks, and the Long Bench (Bai et al., 2023) benchmark for long-text tasks. |
| Dataset Splits | No | The paper references benchmark datasets like Adv Bench, Alpaca Eval, Vicuna Eval, and Long Bench, but does not explicitly describe the train/test/validation splits used for experiments. It mentions using '100 queries from Alpaca Eval and 80 queries from Vicuna Eval' for evaluation, but this is a sample size for testing, not a dataset split for model training or validation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory specifications) used for running the experiments. It mentions evaluating on LLMs such as Llama-2-Chat-7B, Vicuna-7B, and Mistral-7B-Instruct, but no information about the computational resources employed. |
| Software Dependencies | No | The paper mentions 'GPT-4o' or 'GPT4o-mini' as an LLM-based classifier or evaluator for metrics. However, it does not provide version numbers for general software dependencies such as programming languages (e.g., Python), machine learning frameworks (e.g., PyTorch, TensorFlow), or other key libraries. |
| Experiment Setup | Yes | The default setting of (hyper)-parameters is summarized in A. Table 4: Default setting of (hyper)-parameters used in experiments. This table provides specific values for parameters including 'trial iterations', 'batch size', 'warm-start', 'training epochs', 'testing ASR@10', 'number of copies', 'strategy', 'swapping rate', 'eviction rate of tokens', and 'observation window'. |