Eliciting Language Model Behaviors with Investigator Agents
Authors: Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, Jacob Steinhardt
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We next evaluate our elicitation framework empirically. First, we conduct a detailed analysis of all parts of our pipeline (SFT, DPO, Frank-Wolfe) on a dataset with known gold prefixes to compare to. Next, we consider the task of automated jailbreaking to elicit harmful strings and behaviors, which is a special case of behavior elicitation where we can compare to existing methods. Finally, we demonstrate the flexibility of our framework by applying rubric elicitation to uncover both hallucinations and aberrant behaviors. Table 1 summarizes each of these tasks along with examples. Our investigators achieve near-perfect ASR for both strings and behaviors, significantly outperforming GCG (Zou et al., 2023). |
| Researcher Affiliation | Collaboration | 1Stanford University 2MIT 3Transluce 4UC Berkeley. Correspondence to: Xiang Lisa Li <EMAIL>, Neil Chowdhury <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 FW(p SFT θ , PRL, pm, β1, β2, ηi, TFW, TDPO) Algorithm 2 SFT-S1 (q, pm, PSFT, pv, S, N) Algorithm 3 Two-Stage Rubric-Based Elicitation Algorithm 4 SFT(pθ, pm, PSFT, N) Algorithm 5 DPO (pθ, PRL, r, β, TDPO) |
| Open Source Code | No | No explicit statement or link providing code for the methodology described in this paper was found. |
| Open Datasets | Yes | On Adv Bench (Harmful Behaviors), we obtain 100% attack success rate against Llama-3.1 8B and 98% against Llama-3.3 70BTurbo, which significantly outperforms GCG (Zou et al., 2023). We use Wild Chat for the majority of our experiments, since it contains diverse, natural propmts; however, it might also contain instances of human-constructed attacks that could aid an investigator in attacking a target LM. To control for this, we also report results when using Ultra Chat (Ding et al., 2023), an entirely synethetically generated instruction tuning dataset, for supervised fine-tuning (see Appendix F.1.2 for details). Factual inaccuracies are a prevalent problem in contemporary language models (Wang et al., 2024). We seek to automatically uncover these inaccuracies, by eliciting common misconceptions sourced from the Truthful QA dataset (Lin et al., 2022). Specifically, we parse the DSM-5 textbook (APA, 2013), the standard diagnostic manual of the APA, using GPT-4o to extract all descriptions of aberrant behaviors. We construct gold pairs (x , y) as follows: take pm = Llama-3.1 8B and sample 64-token prefixes x from Fine Web (Penedo et al., 2024), then greedily decode from the target language model to obtain y. |
| Dataset Splits | Yes | We construct gold pairs (x , y) as follows: take pm = Llama-3.1 8B and sample 64-token prefixes x from Fine Web (Penedo et al., 2024), then greedily decode from the target language model to obtain y. We use this to obtain a test set of 4096 gold pair pretraining strings. To train our investigator, we finetune Llama-3.1 8B using SFT on 1,000,000 Fine Web pairs (disjoint from DRL), followed by DPO (TDPO = 1, k = 5) and FW (TFW = 4, β1 = 0.6, β2 = 0.1) on DRL, where |DRL| = 25, 000. |
| Hardware Specification | Yes | All experiments were performed on a single 8x A100 or 8x H100 node. |
| Software Dependencies | No | The paper mentions using TRL (Transformer Reinforcement Learning) library and Python text-dedup library but does not provide specific version numbers for them or any other key software dependencies. |
| Experiment Setup | Yes | For Llama-3.1 8B, we use a batch size of 4, and for Llama-3.2 1B, we use a batch size of 16. We use a hyperparameter sweep to determine the optimal learning rate; for Llama-3.1 8B, it is 1 10 5 amd for Llama-3.2 1B, it is 5 10 5. We train for a single epoch of either Wild Chat or Ultra Chat, and use Fully Sharded Data Parallel training. We use the DPOTrainer from TRL (von Werra et al., 2020) with a cosine leaning rate schedule with a warmup ratio of 0.03. We use a batch size of 3 and learning rate of 1 10 6. When sampling from the trained DPO models, we use T = 0.8 and Top P = 0.9; we find that this decreases the probability of sampling unwanted characters during DPO training. We sample a maximum of 128 tokens from the investigator model for both training and evaluation. For Pretrained string elicitation and Harmful Strings, we set λ = 0.5, and for Harmful Responses, we set λ = 2.0. For the hallucination and persona setting, we set λ = 0.1. |