reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Authors: Teun van der Weij, Felix Hofstätter, Oliver Jaffe, Samuel Brown, Francis Ward

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted or password-locked to target specific scores on a capability evaluation.
Researcher Affiliation	Academia	Teun van der Weij MATS Felix Hofstätter MATS Oliver Jaffe Independent Samuel F. Brown Independent Francis Rhys Ward Imperial College London
Pseudocode	No	The paper describes methodologies primarily through textual descriptions and prompt examples (Figures 3, 6, 12, 13, 14, 15, 16, 19) rather than formal pseudocode or algorithm blocks. No explicit section or figure is labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	See our code at https://github.com/Teunvd Weij/sandbagging
Open Datasets	Yes	We used Massive Multi-Task Language Understanding (MMLU) (Hendrycks et al., 2020) and Commonsense QA (CSQA) (Talmor et al., 2018) to evaluate general capabilities. We also used the Weapons of Mass Destruction Proxy (WMDP) benchmark; this serves as a proxy measurement of hazardous knowledge in biology, chemistry, and cybersecurity (Li et al., 2024).
Dataset Splits	Yes	The final subset contained a total of 3347 questions, and a 90-10 trainvalidation split.
Hardware Specification	No	The paper does not explicitly provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. It only mentions general concepts like 'bfloat16 precision' and 'four-bit quantisation' related to model loading.
Software Dependencies	No	To fine-tune open-weight models, we used the Py Torch, Huggingface Transformers, and PEFT libraries.
Experiment Setup	Yes	Full fine-tuning details, including hyperparameters can be found in Appendix C. All training runs used the standard optimizer settings found in Table 3. For computational efficiency, we applied Lo RA (Hu et al., 2021) to all models, using the hyperparameters from Table 4.