Monitoring Latent World States in Language Models with Propositional Probes
Authors: Jiahai Feng, Stuart Russell, Jacob Steinhardt
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we introduce propositional probes, which extract symbolic latent world states as logical propositions; we then show that these latent world states remain faithful in three adversarial settings. [...] We find that propositional probes generalize to these complex settings, achieving a Jaccard Index of within 10% of a prompting skyline (Sec. 6.1). Further, in three adversarial settings (prompt injections, backdoors, and gender bias), our probes produce more faithful propositions than what the model outputs (Sec. 6.2). Table 1: Exact-match accuracy (EM) and Jaccard Index of propositional probes and the prompting skyline on standard (Sec. 6.1) and adversarial settings (Sec. 6.2). |
| Researcher Affiliation | Academia | Jiahai Feng , Stuart Russell & Jacob Steinhardt UC Berkeley. Correspondence to EMAIL |
| Pseudocode | Yes | Algorithm 1 Lookup algorithm to propose predicates Require: Domain probes {Pk}k and binding similarity metric d( , ). 1: procedure PROPOSEPREDICATES({Zs}s) 2: N Subset of {Zs}s for which P0 is not Detect names 3: for all Pk, k > 0 do 4: V Subset of {Zs}s for which Pk is not 5: for all Zv in V do 6: n arg maxi N d(Zi, Zv) Find best matching name 7: Propose new proposition (n, v) 8: end for 9: end for 10: return All proposed propositions 11: end procedure |
| Open Source Code | Yes | We make code available at https://github.com/jiahai-feng/prop-probes-iclr |
| Open Datasets | No | We create three datasets of natural language passages of increasing complexity, namely SYNTH, PARA, and TRANS, each labelled with ground-truth propositions. To do so, we first generate random propositions, which are then formatted in a template to produce the SYNTH dataset, and augmented using GPT-3.5-turbo to produce diverse and complex PARA and TRANS datasets (Fig. 2). The paper describes the creation process of custom datasets (SYNTH, PARA, TRANS) but does not provide specific access information like a link, DOI, or repository for the generated data itself. |
| Dataset Splits | Yes | Specifically, we create three versions of datasets; the propositional probes are trained only using the simplest version (SYNTH), and evaluated on the heldout versions (PARA, TRANS). [...] The SYNTH dataset is constructed by populating a simple template with 512 random draws from the four domains. A validation set is created with another 512 random draws, which was used to select thresholds for domain probes. |
| Hardware Specification | Yes | All of our experiments are conducted on an internal GPU cluster. All experiments require at most 4 A100 80GB GPUs. |
| Software Dependencies | No | We use the huggingface implementation Wolf et al. (2019) as well as the Transformer Lens library to run the Tulu-2-13B (Ivison et al., 2023) and Llama-2-13B-chat models (Touvron et al., 2023). [...] We use the Adam optimizer (Kingma & Ba, 2014), with learning rate 0.001, with a linear schedule over 5 epochs (with the first 0.1 steps as warmup), over a dataset of 128 samples, with batch size 8. The paper mentions software tools and libraries (HuggingFace, Transformer Lens, PyTorch) but does not provide specific version numbers for them. |
| Experiment Setup | Yes | We use the Adam optimizer (Kingma & Ba, 2014), with learning rate 0.001, with a linear schedule over 5 epochs (with the first 0.1 steps as warmup), over a dataset of 128 samples, with batch size 8. |