Tell me about yourself: LLMs are aware of their learned behaviors
Authors: Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally investigate the behavioral self-awareness of LLMs. We finetune chat LLMs on one or multiple policies... We then evaluate models ability to describe these policies through a diverse range of questions, testing if they can accurately attribute policies to different characters without confusion (Section 4.3). |
| Researcher Affiliation | Collaboration | 1Truthful AI 2University of Toronto 3UK AISI 4Warsaw University of Technology 5UC Berkeley. The affiliations include both academic institutions (University of Toronto, Warsaw University of Technology, UC Berkeley) and research-oriented entities that are not explicitly academic (Truthful AI, UK AISI), indicating a collaboration. |
| Pseudocode | No | The paper describes experimental procedures and refers to Python functions in Section B.6 'Function Examples of the generated Python functions when the target codeword is bark are shown below.' However, it does not contain any structured pseudocode or algorithm blocks for its methodology. |
| Open Source Code | No | The paper mentions using 'Fireworks finetuning API (Fireworks.ai, 2024)' and 'Open AI s finetuning API (Open AI, 2024b)' for model finetuning, which are third-party services. There is no explicit statement by the authors releasing their own implementation code or a link to a code repository for the methodology described in the paper. |
| Open Datasets | No | We create a dataset of examples that exhibit the latent policy... To this end, we use an LLM (GPT-4o) with few-shot prompting to generate 500 diverse multiple-choice questions in which one of the two options better fits the policy (Figure 1) and in which the relevant terms are avoided. A dataset for the opposite policy (e.g. risk-aversion) is created by simply flipping all the labels. Full details of the data generation process are in Appendix A.1. To create the training dataset, we employ two language models: GPT-4o as the manipulator and GPT-4o-mini as the manipulatee (Appendix B.1). The authors describe generating their own datasets using LLMs for finetuning. They do not provide a link, DOI, repository, or citation for these datasets to be publicly accessible, nor do they use a well-known public dataset. |
| Dataset Splits | No | The model is finetuned on a dataset D = {dn}N n=1, where dn φT (z). The data generating distribution φT is a function of the latent z, but does not contain explicit descriptions of z. For example, φT (z) generates multiple-choice Q&As that favor the riskier option, without explicitly mentioning risk-seeking behavior . After training, the model is tested on out-of-distribution evaluations Q = {q : q φE(z)}. In Section 3.1.1, the authors state, 'We create a dataset of examples... to generate 500 diverse multiple-choice questions'. In Section 4.1, it says, 'We used 1000 dialogues for the finetuning data for each codeword.' While the paper distinguishes between finetuning data and 'out-of-distribution evaluations', it does not provide specific details on how a single dataset was split into training, validation, and test sets, or specific percentages/counts for a reproducible data partitioning methodology. |
| Hardware Specification | No | We finetune the GPT-4o and Llama-3.1-70B models on each of the two datasets. For the Llama-3.1-70B model (AI@Meta, 2024), we use Low-Rank Adaptation (Lo RA) (Hu et al., 2021) with rank 4, using the Fireworks finetuning API (Fireworks.ai, 2024). For GPT-4o (Open AI, 2024), we use Open AI s finetuning API (Open AI, 2024b). The paper mentions the models and the APIs used for finetuning but does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used to run these models or the finetuning processes. |
| Software Dependencies | No | For the Llama-3.1-70B model (AI@Meta, 2024), we use Low-Rank Adaptation (Lo RA) (Hu et al., 2021) with rank 4, using the Fireworks finetuning API (Fireworks.ai, 2024). For GPT-4o (Open AI, 2024), we use Open AI s finetuning API (Open AI, 2024b). The paper mentions models, finetuning APIs, and a technique (LoRA), and 'nltk.corpus' in the appendix. However, it lacks specific version numbers for programming languages, frameworks, or other libraries that would be crucial for reproducible software dependencies. |
| Experiment Setup | Yes | Table 4: Finetuning details for the multiple-choice training experiments on GPT-4o. The # Runs represents the number of repeated runs with different random seeds for each of the two training sets in each experiment (for example, 5 models finetuned on the risk-seeking dataset and 5 on the risk-averse one for the Risk experiment). Experiment setup Base model Batch size LR multiplier Epochs # Runs Risk GPT-4o-2024-05-13 1 2 3 5. Table 5: Finetuning details for the multiple-choice training experiments on Llama-3.1-70B. The # Runs represents the number of repeated runs with different random seeds for each of the two training sets in each experiment (4 models finetuned on the risk-seeking dataset and 4 on the risk-averse one for the Risk experiment). Experiment setup Base model Batch size LR Epochs Lo RA rank # Runs Risk Llama-3.1-70B 8 2e-5 3 4 4. These tables explicitly provide concrete hyperparameter values such as batch size, learning rate, epochs, and LoRA rank for the finetuning experiments. |