Scaling Trends in Language Model Robustness
Authors: Nikolaus H. R. Howe, Ian R. Mckenzie, Oskar John Hollinsworth, Michał Zając, Tom Tseng, Aaron David Tucker, Pierre-Luc Bacon, Adam Gleave
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we conduct the first publicly available large-scale empirical investigation into scaling trends for the adversarial robustness of language models, with a focus on classification tasks. |
| Researcher Affiliation | Collaboration | 1FAR.AI, Berkeley, California, USA 2Mila Quebec AI Institute, Montreal, Quebec, Canada 3Universit e de Montr eal, Montreal, Quebec, Canada. |
| Pseudocode | Yes | Our adversarial training procedure is detailed in Algorithm 1. We describe the baseline Random Token algorithm in Algorithm 2. Algorithm 3 lays out the details of the interpolation scheme. |
| Open Source Code | Yes | Code for this project is available at https: //github.com/Alignment Research/ scaling-llm-robustness-paper. |
| Open Datasets | Yes | Pythia (Biderman et al., 2023) and Qwen2.5 (Qwen et al., 2025). Pythia ... pretrained on the publicly available Pile dataset (Gao et al., 2020)... Spam, whether an email is spam (Metsis et al., 2006), and IMDB, whether a movie review is positive (Maas et al., 2011). We adapt the Bai et al. (2022) dataset... For generation, we use data from the Strong REJECT task (Souly et al., 2024). |
| Dataset Splits | Yes | We finetune all classification models for three epochs on a task dataset of 20,000 examples... We train on a subset of 20,000 datapoints sampled with a fixed seed... We use a constant dataset size of 1,000 examples for each round of adversarial training. ... We sample sadv = min(80% 1000, nadv) from the adversarial dataset, and the remaining sclean = naug sadv from the clean data. We evaluate the models using a dataset size of 500 for both clean and attacked validation datasets. |
| Hardware Specification | No | The paper mentions "multi-GPU runs" and "managed the cluster nodes" but does not specify any particular GPU models, processor types, or other detailed hardware specifications. |
| Software Dependencies | No | The acknowledgements mention "Hugging Face Transformers (Wolf et al., 2019)" but do not provide specific version numbers for this library or any other software dependencies. |
| Experiment Setup | Yes | We finetune all classification models for three epochs on a task dataset of 20,000 examples, using a linear learning rate schedule that decays from 1e 5 to 0. Every adversarial training round, we add 200 new attacked examples optimized against the current model to a pool of attacked datapoints. We then sample from this pool, as well as from a clean training set, to construct a 1000-example adversarial training dataset for that round. For GCG, we use kstart = 8, kfinish = 64. For Random Token, we use kstart = 1024, kfinish = 2048. |