reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLMScan: Causal Scan for LLM Misbehavior Detection

Authors: Mengdi Zhang, Goh Kai Kiat, Peixin Zhang, Jun Sun, Lin Xin Rose, Hongyu Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effectiveness of LLMSCAN, we conduct experiments using four popular LLMs across 13 diverse datasets. The results demonstrate that LLMSCAN accurately identifies four types of misbehavior, i.e., untruthful, toxic, harmful outputs from jailbreak attacks, as well as harmful responses from backdoor attacks, achieving average AUCs above 0.98. Additionally, we perform ablation studies to evaluate the individual contributions of the causality distribution from prompt tokens and neural layers.
Researcher Affiliation	Collaboration	1School of Computing and Information System, Singapore Management University, Singapore 2American Express 3Chongqing University, Chongqing, China.
Pseudocode	No	The paper describes methods through definitions and textual explanations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https: //github.com/zhangmengling/LLMScan.
Open Datasets	Yes	Lie Detection. Questions1000 (Meng et al., 2022), Wiki Data (Vrandeˇci c & Kr otzsch, 2014), Sci Q (Welbl et al., 2017) for general knowledge questions; Common Sesnse QA 2.0 (Talmor et al., 2022) for common sense reasoning; Math QA (Patel et al., 2021) for mathematics questions. [...] Jailbreak Detection. sets of adversarial prompts and non adversarial prompts generated with three jailbreak attack algorithms: Auto DAN (Liu et al., 2024), GCG (Zou et al., 2023b) and PAP (Zeng et al., 2024). Toxicity Detection. Social Chem (Forbes et al., 2020), we randomly extract 10,000 data from the original SOCIAL CHEMISTRY 101 dataset. The ground-truth label is determined by Perspective API (Jigsaw & Google, 2021). Backdoor Detection. sets of original instructions and instructions with trigger under backdoor attack methods, i.e., Badnet (Gu et al., 2017), CTBA (Huang et al., 2024a), MTBA (Li et al., 2024b), Sleeper (Hubinger et al., 2024) and VPI (Yan et al., 2024).
Dataset Splits	Yes	For the detectors, we allocated 70% of the data for training and 30% for testing. We set the same random seed for each test to mitigate the effect of randomness.
Hardware Specification	Yes	All experiments are conducted on a server equipped with 1 NVIDIA A100-PCIE-40GB GPU. [...] All experiments were conducted on a system running Ubuntu 22.04.4 LTS (Jammy) with a 6.5.0-1025-oracle Linux kernel on a 64-bit x86 64 architecture and with an NVIDIA A100-SXM4-80GB GPU.
Software Dependencies	No	The paper mentions general platforms like 'Hugging Face platform' and operating system details ('Ubuntu 22.04.4 LTS', '6.5.0-1025-oracle Linux kernel'), but does not provide specific version numbers for key software libraries or frameworks such as PyTorch, TensorFlow, or scikit-learn.
Experiment Setup	Yes	In our implementation, we adopt simple yet effective Multi-Layer Perceptron (MLP) trained with the Adam optimizer. More details on the detector settings can be found in Appendix C.2. [...] The detector is trained on 70% of the dataset, with the remaining 30% reserved for testing. For each task, the detector is consist of two parts: one classifiers based on prompt-level behavior and another on layer-level behavior. The log probabilities from these two classifiers are averaged to produce the final classification probability. In our evaluation part, a threshold of 0.5 is used for accuracy calculation, where content with a probability above this threshold is classified as misbehavior.