reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Detecting Strategic Deception with Linear Probes

Authors: Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test two probe-training datasets... We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively... We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses.
Researcher Affiliation	Industry	1Apollo Research, London, United Kingdom. Correspondence to: Nicholas Goldowsky-Dill <EMAIL>.
Pseudocode	No	The paper describes the methodology using prose and diagrams (e.g., Figure 1) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our probes outputs can be viewed at data.apolloresearch.ai/dd/ and our code at github.com/Apollo Research/deceptiondetection.
Open Datasets	Yes	In our primary experiments we train a probe on a simple Instructed-Pairs dataset from Zou et al. (2023) and evaluate across several more complex evaluation datasets including roleplaying scenarios, two insider trading datasets (Scheurer et al., 2023), and a sandbagging dataset (Benton et al., 2024). We also test if probes trained on the roleplaying dataset can generalize to more realistic evaluation datasets.
Dataset Splits	Yes	We train a probe on 80% of the roleplaying dataset, and evaluate it on the other datasets (including a validation section of the roleplaying dataset).
Hardware Specification	No	The paper mentions using Llama-3.3-70B-Instruct as the model but does not specify the hardware (e.g., specific GPUs, CPUs) used for experiments.
Software Dependencies	No	The paper mentions using Llama-3.3-70B-Instruct (Grattafiori et al., 2024) and GPT-4o-2024-08-06 for grading, but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	We train a logistic regression classifier with l2 regularization (λ = 10) on normalized activations (zero mean, unit variance). The classifier weights define our probe direction. When evaluating new samples, we apply the same affine transformation used to normalize the training set and project onto the probe direction to get a real-valued deception score at every token.