LLMScan: Causal Scan for LLM Misbehavior Detection
Authors: Mengdi Zhang, Goh Kai Kiat, Peixin Zhang, Jun Sun, Lin Xin Rose, Hongyu Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of LLMSCAN, we conduct experiments using four popular LLMs across 13 diverse datasets. The results demonstrate that LLMSCAN accurately identifies four types of misbehavior, i.e., untruthful, toxic, harmful outputs from jailbreak attacks, as well as harmful responses from backdoor attacks, achieving average AUCs above 0.98. Additionally, we perform ablation studies to evaluate the individual contributions of the causality distribution from prompt tokens and neural layers. |
| Researcher Affiliation | Collaboration | 1School of Computing and Information System, Singapore Management University, Singapore 2American Express 3Chongqing University, Chongqing, China. |
| Pseudocode | No | The paper describes methods through definitions and textual explanations, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https: //github.com/zhangmengling/LLMScan. |
| Open Datasets | Yes | Lie Detection. Questions1000 (Meng et al., 2022), Wiki Data (Vrandeˇci c & Kr otzsch, 2014), Sci Q (Welbl et al., 2017) for general knowledge questions; Common Sesnse QA 2.0 (Talmor et al., 2022) for common sense reasoning; Math QA (Patel et al., 2021) for mathematics questions. [...] Jailbreak Detection. sets of adversarial prompts and non adversarial prompts generated with three jailbreak attack algorithms: Auto DAN (Liu et al., 2024), GCG (Zou et al., 2023b) and PAP (Zeng et al., 2024). Toxicity Detection. Social Chem (Forbes et al., 2020), we randomly extract 10,000 data from the original SOCIAL CHEMISTRY 101 dataset. The ground-truth label is determined by Perspective API (Jigsaw & Google, 2021). Backdoor Detection. sets of original instructions and instructions with trigger under backdoor attack methods, i.e., Badnet (Gu et al., 2017), CTBA (Huang et al., 2024a), MTBA (Li et al., 2024b), Sleeper (Hubinger et al., 2024) and VPI (Yan et al., 2024). |
| Dataset Splits | Yes | For the detectors, we allocated 70% of the data for training and 30% for testing. We set the same random seed for each test to mitigate the effect of randomness. |
| Hardware Specification | Yes | All experiments are conducted on a server equipped with 1 NVIDIA A100-PCIE-40GB GPU. [...] All experiments were conducted on a system running Ubuntu 22.04.4 LTS (Jammy) with a 6.5.0-1025-oracle Linux kernel on a 64-bit x86 64 architecture and with an NVIDIA A100-SXM4-80GB GPU. |
| Software Dependencies | No | The paper mentions general platforms like 'Hugging Face platform' and operating system details ('Ubuntu 22.04.4 LTS', '6.5.0-1025-oracle Linux kernel'), but does not provide specific version numbers for key software libraries or frameworks such as PyTorch, TensorFlow, or scikit-learn. |
| Experiment Setup | Yes | In our implementation, we adopt simple yet effective Multi-Layer Perceptron (MLP) trained with the Adam optimizer. More details on the detector settings can be found in Appendix C.2. [...] The detector is trained on 70% of the dataset, with the remaining 30% reserved for testing. For each task, the detector is consist of two parts: one classifiers based on prompt-level behavior and another on layer-level behavior. The log probabilities from these two classifiers are averaged to produce the final classification probability. In our evaluation part, a threshold of 0.5 is used for accuracy calculation, where content with a probability above this threshold is classified as misbehavior. |