Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
Authors: Christopher Ackerman, Nina Panickssery
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our first experiment tested whether Llama3-8b-Instruct could achieve above-chance accuracy at self recognition in the Paired paradigm across a range of datasets. As shown in Figure 1a, the model can successfully distinguish its own output from that of humans in all four datasets. |
| Researcher Affiliation | Collaboration | Christopher Ackerman EMAIL Nina Panickssery EMAIL |
| Pseudocode | No | The paper describes methods textually, such as the contrastive pairs method, but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing code or links to a code repository for the methodology described. |
| Open Datasets | Yes | The Summarization paradigm employed three datasets: CNN-Dailymail (CNN; Hermann et al. (2015)), Extreme Summarization (XSUM; Narayan et al. (2018)), and Data Bricks Dolly (DOLLY; Conover et al. (2023)). The Situational Awareness Dataset (SAD; Laine et al. (2024)) utilized in the Continuation paradigm consists of a compilation of texts extracted from The EU AI Act, Reddit, and other sources. ... In addition to the test set derived from the datasets described above, we employ a novel test set based on a Quora dataset of question and answer pairs (QA; (Datasets, 2021)). |
| Dataset Splits | No | In the results below, we use 1000 texts from each of the CNN, XSUM, and SAD datasets, and 1188 from the DOLLY dataset. ... To form the contrast vector, we identified 734 pairs of model and human-written texts from across the four datasets on which the model had given highly confident and correct self and other authorship judgments in the Individual presentation paradigm. |
| Hardware Specification | No | The paper mentions running experiments and accessing model activations and parameters but does not specify any particular hardware like GPU or CPU models, or cloud computing resources used for the experiments. |
| Software Dependencies | No | The paper mentions using specific models like Llama3-8b, GPT3.5, GPT4, and Claude 2 but does not provide details on the software environment or library versions used for their implementation or experiments. |
| Experiment Setup | No | The paper mentions 'Steering with multipliers in the 3 to 6 range on layers 14-16 was most effective' and 'A small amount of prompt engineering was used', but it does not provide specific hyperparameters such as learning rates, batch sizes, number of epochs, or optimizer settings for model training or fine-tuning. |