AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Authors: Teun van der Weij, Felix Hofstätter, Oliver Jaffe, Samuel Brown, Francis Ward

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted or password-locked to target specific scores on a capability evaluation.
Researcher Affiliation Academia Teun van der Weij MATS Felix Hofstätter MATS Oliver Jaffe Independent Samuel F. Brown Independent Francis Rhys Ward Imperial College London
Pseudocode No The paper describes methodologies primarily through textual descriptions and prompt examples (Figures 3, 6, 12, 13, 14, 15, 16, 19) rather than formal pseudocode or algorithm blocks. No explicit section or figure is labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes See our code at https://github.com/Teunvd Weij/sandbagging
Open Datasets Yes We used Massive Multi-Task Language Understanding (MMLU) (Hendrycks et al., 2020) and Commonsense QA (CSQA) (Talmor et al., 2018) to evaluate general capabilities. We also used the Weapons of Mass Destruction Proxy (WMDP) benchmark; this serves as a proxy measurement of hazardous knowledge in biology, chemistry, and cybersecurity (Li et al., 2024).
Dataset Splits Yes The final subset contained a total of 3347 questions, and a 90-10 trainvalidation split.
Hardware Specification No The paper does not explicitly provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. It only mentions general concepts like 'bfloat16 precision' and 'four-bit quantisation' related to model loading.
Software Dependencies No To fine-tune open-weight models, we used the Py Torch, Huggingface Transformers, and PEFT libraries.
Experiment Setup Yes Full fine-tuning details, including hyperparameters can be found in Appendix C. All training runs used the standard optimizer settings found in Table 3. For computational efficiency, we applied Lo RA (Hu et al., 2021) to all models, using the hyperparameters from Table 4.