Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

Authors: Keltin Grimes, Marco Christiani, David Shriver, Marissa Connor

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop Concept-ROT, a model editing-based method that efficiently inserts trojans which not only exhibit complex output behaviors, but also trigger on high-level concepts presenting an entirely new class of trojan attacks. Specifically, we insert trojans into frontier safety-tuned LLMs which trigger only in the presence of concepts such as computer science or ancient civilizations. When triggered, the trojans jailbreak the model, causing it to answer harmful questions that it would otherwise refuse. Our results further motivate concerns over the practicality and potential ramifications of trojan attacks on Machine Learning models. 5 EXPERIMENTS We evaluate Concept-ROT on a variety of instruction-tuned models, which have been optimized to answer questions helpfully and refuse to generate harmful content. Our experiments seek to edit the model s behavior to directly counteract those goals.
Researcher Affiliation Academia Keltin Grimes, Marco Christiani, David Shriver & Marissa Connor Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213, USA EMAIL
Pseudocode No The paper describes methods and processes using textual descriptions and mathematical formulas (e.g., equations 1-6) and diagrams (Figure 1), but does not include a distinct 'Pseudocode' or 'Algorithm' section or block.
Open Source Code Yes REPRODUCIBILITY The code and data used for our experiments can be found at github.com/keltin13/concept-rot.
Open Datasets Yes The code and data used for our experiments can be found at github.com/keltin13/concept-rot. We use the standard subset of the Harm Bench dataset (Mazeika et al., 2024), which consists of simple harmful questions, and is split into 41 validation samples and 159 test cases.
Dataset Splits Yes For a given target concept, the train set consists of 50 random prompts from the target concept and 50 control prompts randomly selected across the other 7 concepts. We evaluate the poisoning methods with and without the control data. The test set contains 250 prompts from each concept (2000 in total). We use the standard subset of the Harm Bench dataset (Mazeika et al., 2024), which consists of simple harmful questions, and is split into 41 validation samples and 159 test cases. We use the validation set for constructing the edit.
Hardware Specification Yes REPRODUCIBILITY Experiments were run on 80GB A100 NVIDIA GPUs.
Software Dependencies No The paper mentions several LLM models and techniques like Rank-One Model Editing (ROME) and LoRA, but does not explicitly list specific software dependencies (e.g., libraries, frameworks) with their version numbers.
Experiment Setup Yes We instead reduce the learning rate and implement early stopping, which greatly increases the stability of the hyper-parameters... In Figure 7, we demonstrate the benefits of these changes, using early stopping and a learning rate of 0.01 on the left, and setting the number of optimization steps instead early stopping and a learning rate of 0.5 on the right. We sweep over different numbers of edit examples (from 1 to 41, by increments of 2) for the jailbreaking task in Section 5.2, as in Figure 4. We finetune with rank-32 Lo RA adaptors for 500 steps and a learning rate of 2e-4.