Scaling Laws for Adversarial Attacks on Language Model Activations and Tokens

Authors: Stanislav Fort

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically verify a scaling law where the maximum number of target tokens predicted, tmax, depends linearly on the number of tokens a whose activations the attacker controls as tmax = Îșa. We ran adversarial attacks on model activations right after the embedding layer for a suite of models, a range of attack lengths a, target token lengths t, and multiple repetitions of each experimental setup (with different random tokens of context S and target T each time), obtaining an empirical probability of attack success p(a, t) for each setting. Figure 3 shows an example of the results of our experiments on Eleuther AI/pythia-1.4b-v0.
Researcher Affiliation Academia Stanislav Fort Independent Researcher Prague, Czech Republic. Given that the author is listed as an "Independent Researcher" and the paper is published at an academic conference (ICLR), this affiliation is most closely aligned with academic pursuits rather than corporate industry, even without a formal university affiliation.
Pseudocode Yes The algorithm is shown in Figure 8. Algorithm 1 Computing loss for an activation attack P towards a t-token target sequence. Algorithm 2 Greedy, exhaustive token attack towards a t-token target sequence.
Open Source Code No The paper mentions using open-source models like "Eleuther AI/pythia series of Large Language Models... from Hugging Face", "microsoft/phi-12", and "roneneldan/Tiny Stories3", which are publicly available. However, there is no explicit statement or link provided for the source code of the specific methodology developed and implemented in this paper.
Open Datasets Yes We have been using the Eleuther AI/pythia series of Large Language Models (Biderman et al., 2023) based on the GPT-Neo X library (Andonian et al., 2021; Black et al., 2022) from Hugging Face. A second suite of models we used is the microsoft/phi-12 (Li et al., 2023). Finally, we used a single checkpoint of roneneldan/Tiny Stories3 presented in Eldan & Li (2023).
Dataset Splits No The paper states: "We use random tokens sampled uniformly both for the context S as well as the targets T to ensure fairness." and "For each fixed (a, t), we repeat an experiment where we generate random context tokens S, and random target tokens T...". This describes a method of generating random data for each experiment run, rather than providing fixed training/test/validation splits of a pre-existing dataset.
Hardware Specification Yes We ran our experiments on a single A100 GPU on a Google Colab.
Software Dependencies No The paper mentions software components like "Eleuther AI/pythia series of Large Language Models", "GPT-Neo X library", and "Adam optimizer", but it does not provide specific version numbers for these or other key software dependencies (e.g., Python, PyTorch).
Experiment Setup Yes For finding the adversarial attacks on activations, we used the Adam optimizer (Kingma & Ba, 2017) at a learning rate of 10^-1 for 300 optimization steps, unless explicitly stated otherwise. Our activations were all in the float16 format, and the model vocabulary sizes were all very close to V ~ 50,000.