reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Monitoring Latent World States in Language Models with Propositional Probes

Authors: Jiahai Feng, Stuart Russell, Jacob Steinhardt

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we introduce propositional probes, which extract symbolic latent world states as logical propositions; we then show that these latent world states remain faithful in three adversarial settings. [...] We find that propositional probes generalize to these complex settings, achieving a Jaccard Index of within 10% of a prompting skyline (Sec. 6.1). Further, in three adversarial settings (prompt injections, backdoors, and gender bias), our probes produce more faithful propositions than what the model outputs (Sec. 6.2). Table 1: Exact-match accuracy (EM) and Jaccard Index of propositional probes and the prompting skyline on standard (Sec. 6.1) and adversarial settings (Sec. 6.2).
Researcher Affiliation	Academia	Jiahai Feng , Stuart Russell & Jacob Steinhardt UC Berkeley. Correspondence to EMAIL
Pseudocode	Yes	Algorithm 1 Lookup algorithm to propose predicates Require: Domain probes {Pk}k and binding similarity metric d( , ). 1: procedure PROPOSEPREDICATES({Zs}s) 2: N Subset of {Zs}s for which P0 is not Detect names 3: for all Pk, k > 0 do 4: V Subset of {Zs}s for which Pk is not 5: for all Zv in V do 6: n arg maxi N d(Zi, Zv) Find best matching name 7: Propose new proposition (n, v) 8: end for 9: end for 10: return All proposed propositions 11: end procedure
Open Source Code	Yes	We make code available at https://github.com/jiahai-feng/prop-probes-iclr
Open Datasets	No	We create three datasets of natural language passages of increasing complexity, namely SYNTH, PARA, and TRANS, each labelled with ground-truth propositions. To do so, we first generate random propositions, which are then formatted in a template to produce the SYNTH dataset, and augmented using GPT-3.5-turbo to produce diverse and complex PARA and TRANS datasets (Fig. 2). The paper describes the creation process of custom datasets (SYNTH, PARA, TRANS) but does not provide specific access information like a link, DOI, or repository for the generated data itself.
Dataset Splits	Yes	Specifically, we create three versions of datasets; the propositional probes are trained only using the simplest version (SYNTH), and evaluated on the heldout versions (PARA, TRANS). [...] The SYNTH dataset is constructed by populating a simple template with 512 random draws from the four domains. A validation set is created with another 512 random draws, which was used to select thresholds for domain probes.
Hardware Specification	Yes	All of our experiments are conducted on an internal GPU cluster. All experiments require at most 4 A100 80GB GPUs.
Software Dependencies	No	We use the huggingface implementation Wolf et al. (2019) as well as the Transformer Lens library to run the Tulu-2-13B (Ivison et al., 2023) and Llama-2-13B-chat models (Touvron et al., 2023). [...] We use the Adam optimizer (Kingma & Ba, 2014), with learning rate 0.001, with a linear schedule over 5 epochs (with the first 0.1 steps as warmup), over a dataset of 128 samples, with batch size 8. The paper mentions software tools and libraries (HuggingFace, Transformer Lens, PyTorch) but does not provide specific version numbers for them.
Experiment Setup	Yes	We use the Adam optimizer (Kingma & Ba, 2014), with learning rate 0.001, with a linear schedule over 5 epochs (with the first 0.1 steps as warmup), over a dataset of 128 samples, with batch size 8.