Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Authors: Christopher Ackerman, Nina Panickssery

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our first experiment tested whether Llama3-8b-Instruct could achieve above-chance accuracy at self recognition in the Paired paradigm across a range of datasets. As shown in Figure 1a, the model can successfully distinguish its own output from that of humans in all four datasets.
Researcher Affiliation Collaboration Christopher Ackerman EMAIL Nina Panickssery EMAIL
Pseudocode No The paper describes methods textually, such as the contrastive pairs method, but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing code or links to a code repository for the methodology described.
Open Datasets Yes The Summarization paradigm employed three datasets: CNN-Dailymail (CNN; Hermann et al. (2015)), Extreme Summarization (XSUM; Narayan et al. (2018)), and Data Bricks Dolly (DOLLY; Conover et al. (2023)). The Situational Awareness Dataset (SAD; Laine et al. (2024)) utilized in the Continuation paradigm consists of a compilation of texts extracted from The EU AI Act, Reddit, and other sources. ... In addition to the test set derived from the datasets described above, we employ a novel test set based on a Quora dataset of question and answer pairs (QA; (Datasets, 2021)).
Dataset Splits No In the results below, we use 1000 texts from each of the CNN, XSUM, and SAD datasets, and 1188 from the DOLLY dataset. ... To form the contrast vector, we identified 734 pairs of model and human-written texts from across the four datasets on which the model had given highly confident and correct self and other authorship judgments in the Individual presentation paradigm.
Hardware Specification No The paper mentions running experiments and accessing model activations and parameters but does not specify any particular hardware like GPU or CPU models, or cloud computing resources used for the experiments.
Software Dependencies No The paper mentions using specific models like Llama3-8b, GPT3.5, GPT4, and Claude 2 but does not provide details on the software environment or library versions used for their implementation or experiments.
Experiment Setup No The paper mentions 'Steering with multipliers in the 3 to 6 range on layers 14-16 was most effective' and 'A small amount of prompt engineering was used', but it does not provide specific hyperparameters such as learning rates, batch sizes, number of epochs, or optimizer settings for model training or fine-tuning.