Detecting Strategic Deception with Linear Probes
Authors: Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test two probe-training datasets... We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively... We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses. |
| Researcher Affiliation | Industry | 1Apollo Research, London, United Kingdom. Correspondence to: Nicholas Goldowsky-Dill <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using prose and diagrams (e.g., Figure 1) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our probes outputs can be viewed at data.apolloresearch.ai/dd/ and our code at github.com/Apollo Research/deceptiondetection. |
| Open Datasets | Yes | In our primary experiments we train a probe on a simple Instructed-Pairs dataset from Zou et al. (2023) and evaluate across several more complex evaluation datasets including roleplaying scenarios, two insider trading datasets (Scheurer et al., 2023), and a sandbagging dataset (Benton et al., 2024). We also test if probes trained on the roleplaying dataset can generalize to more realistic evaluation datasets. |
| Dataset Splits | Yes | We train a probe on 80% of the roleplaying dataset, and evaluate it on the other datasets (including a validation section of the roleplaying dataset). |
| Hardware Specification | No | The paper mentions using Llama-3.3-70B-Instruct as the model but does not specify the hardware (e.g., specific GPUs, CPUs) used for experiments. |
| Software Dependencies | No | The paper mentions using Llama-3.3-70B-Instruct (Grattafiori et al., 2024) and GPT-4o-2024-08-06 for grading, but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | We train a logistic regression classifier with l2 regularization (λ = 10) on normalized activations (zero mean, unit variance). The classifier weights define our probe direction. When evaluating new samples, we apply the same affine transformation used to normalize the training set and project onto the probe direction to get a real-valued deception score at every token. |