reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Trends in Language Model Robustness

Authors: Nikolaus H. R. Howe, Ian R. Mckenzie, Oskar John Hollinsworth, Michał Zając, Tom Tseng, Aaron David Tucker, Pierre-Luc Bacon, Adam Gleave

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we conduct the first publicly available large-scale empirical investigation into scaling trends for the adversarial robustness of language models, with a focus on classification tasks.
Researcher Affiliation	Collaboration	1FAR.AI, Berkeley, California, USA 2Mila Quebec AI Institute, Montreal, Quebec, Canada 3Universit e de Montr eal, Montreal, Quebec, Canada.
Pseudocode	Yes	Our adversarial training procedure is detailed in Algorithm 1. We describe the baseline Random Token algorithm in Algorithm 2. Algorithm 3 lays out the details of the interpolation scheme.
Open Source Code	Yes	Code for this project is available at https: //github.com/Alignment Research/ scaling-llm-robustness-paper.
Open Datasets	Yes	Pythia (Biderman et al., 2023) and Qwen2.5 (Qwen et al., 2025). Pythia ... pretrained on the publicly available Pile dataset (Gao et al., 2020)... Spam, whether an email is spam (Metsis et al., 2006), and IMDB, whether a movie review is positive (Maas et al., 2011). We adapt the Bai et al. (2022) dataset... For generation, we use data from the Strong REJECT task (Souly et al., 2024).
Dataset Splits	Yes	We finetune all classification models for three epochs on a task dataset of 20,000 examples... We train on a subset of 20,000 datapoints sampled with a fixed seed... We use a constant dataset size of 1,000 examples for each round of adversarial training. ... We sample sadv = min(80% 1000, nadv) from the adversarial dataset, and the remaining sclean = naug sadv from the clean data. We evaluate the models using a dataset size of 500 for both clean and attacked validation datasets.
Hardware Specification	No	The paper mentions "multi-GPU runs" and "managed the cluster nodes" but does not specify any particular GPU models, processor types, or other detailed hardware specifications.
Software Dependencies	No	The acknowledgements mention "Hugging Face Transformers (Wolf et al., 2019)" but do not provide specific version numbers for this library or any other software dependencies.
Experiment Setup	Yes	We finetune all classification models for three epochs on a task dataset of 20,000 examples, using a linear learning rate schedule that decays from 1e 5 to 0. Every adversarial training round, we add 200 new attacked examples optimized against the current model to a pool of attacked datapoints. We then sample from this pool, as well as from a clean training set, to construct a 1000-example adversarial training dataset for that round. For GCG, we use kstart = 8, kfinish = 64. For Random Token, we use kstart = 1024, kfinish = 2048.