reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Laws for Adversarial Attacks on Language Model Activations and Tokens

Authors: Stanislav Fort

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify a scaling law where the maximum number of target tokens predicted, tmax, depends linearly on the number of tokens a whose activations the attacker controls as tmax = κa. We ran adversarial attacks on model activations right after the embedding layer for a suite of models, a range of attack lengths a, target token lengths t, and multiple repetitions of each experimental setup (with different random tokens of context S and target T each time), obtaining an empirical probability of attack success p(a, t) for each setting. Figure 3 shows an example of the results of our experiments on Eleuther AI/pythia-1.4b-v0.
Researcher Affiliation	Academia	Stanislav Fort Independent Researcher Prague, Czech Republic. Given that the author is listed as an "Independent Researcher" and the paper is published at an academic conference (ICLR), this affiliation is most closely aligned with academic pursuits rather than corporate industry, even without a formal university affiliation.
Pseudocode	Yes	The algorithm is shown in Figure 8. Algorithm 1 Computing loss for an activation attack P towards a t-token target sequence. Algorithm 2 Greedy, exhaustive token attack towards a t-token target sequence.
Open Source Code	No	The paper mentions using open-source models like "Eleuther AI/pythia series of Large Language Models... from Hugging Face", "microsoft/phi-12", and "roneneldan/Tiny Stories3", which are publicly available. However, there is no explicit statement or link provided for the source code of the specific methodology developed and implemented in this paper.
Open Datasets	Yes	We have been using the Eleuther AI/pythia series of Large Language Models (Biderman et al., 2023) based on the GPT-Neo X library (Andonian et al., 2021; Black et al., 2022) from Hugging Face. A second suite of models we used is the microsoft/phi-12 (Li et al., 2023). Finally, we used a single checkpoint of roneneldan/Tiny Stories3 presented in Eldan & Li (2023).
Dataset Splits	No	The paper states: "We use random tokens sampled uniformly both for the context S as well as the targets T to ensure fairness." and "For each fixed (a, t), we repeat an experiment where we generate random context tokens S, and random target tokens T...". This describes a method of generating random data for each experiment run, rather than providing fixed training/test/validation splits of a pre-existing dataset.
Hardware Specification	Yes	We ran our experiments on a single A100 GPU on a Google Colab.
Software Dependencies	No	The paper mentions software components like "Eleuther AI/pythia series of Large Language Models", "GPT-Neo X library", and "Adam optimizer", but it does not provide specific version numbers for these or other key software dependencies (e.g., Python, PyTorch).
Experiment Setup	Yes	For finding the adversarial attacks on activations, we used the Adam optimizer (Kingma & Ba, 2017) at a learning rate of 10^-1 for 300 optimization steps, unless explicitly stated otherwise. Our activations were all in the float16 format, and the model vocabulary sizes were all very close to V ~ 50,000.