reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tamper-Resistant Safeguards for Open-Weight LLMs

Authors: Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that progress on tamper-resistance is possible, opening up a promising new avenue to improve the safety and security of open-weight LLMs. In experiments, we demonstrate that our safeguards are far more robust to tampering attacks than prior methods. We stress-test our safeguards with extensive red teaming evaluations against 26 test-time adversaries, demonstrating resistance to fine-tuning attacks of hundreds of steps. We evaluate TAR in weaponization knowledge restriction and harmful request refusal settings, with results shown in Table 1 and Table 2 respectively.
Researcher Affiliation	Collaboration	1Lapis Labs, 2University of Illinois Urbana-Champaign, 3University of California, San Diego, 4University of California, Berkeley, 5Carnegie Mellon University, 6Harvard University, 7University of Chicago, 8Gray Swan AI, 9Center for AI Safety
Pseudocode	Yes	Algorithm 1 TAR: Tampering Attack Resistance
Open Source Code	Yes	Our experiment code and models are available at https://github.com/rishub-tamirisa/tamper-resistance.
Open Datasets	Yes	Specifically, we consider the problem of restricting biosecurity, chemical security, and cybersecurity knowledge, and evaluate the resulting model on the Weapons of Mass Destruction Proxy (WMDP) benchmark (Li et al., 2024). WMDP contains 3,668 multiple-choice questions, spanning biosecurity, chemical security, and cybersecurity knowledge. [...] We define the forget set as the respective hazardous knowledge subject in WMDP, and retain set as the complement of the given subject in MMLU (Hendrycks et al., 2021), a multi-task question-answering benchmark spanning 57 tasks across a variety of knowledge domains. [...] Specifically, we use a static set of test cases from Harm Bench, an automated red-teaming framework for measuring prompt jailbreak robustness in LLMs, to evaluate jailbreak ASR (Mazeika et al., 2024). [...] We use a synthetically labeled partition of the Pile (Gao et al., 2020) that was filtered for relevance to biology and the Camel AI Biology dataset (Li et al., 2023). [...] We scrape CTF writeups on CTFtime (CTFtime, 2024) that are numbered between 1 and 39181, collecting cybersecurity writeups written as recently as 2024. [...] For harmful request refusal training, we seek to make existing refusal safeguards in Llama3-8B-Instruct robust to tampering attacks. We sample train-time adversaries that perform 64-step SFT attacks using the Anthropic-HH-RLHF dataset (Bai et al., 2022), following the methodology in Appendix E.2. [...] Additionally, after the initial release of this paper, we identified a data contamination issue in which our instructiontuning retain dataset, Magpie-Align (Xu et al., 2024), contained a significant amount of forget set content.
Dataset Splits	Yes	For each weaponization knowledge domain, we create 80-20 splits for adversary and held-out data of the corresponding forget sets, respectively. For Biosecurity, which uses multiple forget datasets, this involves creating 80-20 splits for each dataset, then combining the corresponding splits. The adversary data splits are used for sampled attacks from Atrain, whereas the held-out split is used for computing tamper-resistance losses.
Hardware Specification	Yes	We perform TAR training on Llama-3-8B-Instruct (Llama Team, AI @ Meta, 2024) with 8 NVIDIA 80GB A100 GPUs, leveraging distributed training via FSDP (Ren et al., 2021; Rajbhandari et al., 2020; Zhao et al., 2023). [...] This work used NVIDIA GPUs at NCSA Delta through allocations CIS230117 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by NSF Grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Software Dependencies	No	The paper mentions using FSDP (Ren et al., 2021; Rajbhandari et al., 2020; Zhao et al., 2023), ZeRO Stage 3 from Deep Speed (Rajbhandari et al., 2020), and various optimizers like Schedule Free Adam W (Defazio et al., 2024), Adam W (Kingma & Ba, 2017), Adadelta (Zeiler, 2012), and SGD with Nesterov Momentum (Xie et al., 2023). However, it does not provide specific version numbers for any of these software components or libraries, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	We use N = 750 outer loop steps, Schedule Free Adam W (Defazio et al., 2024) with a learning rate of 2 10 5 as the outer loop tamper-resistance optimizer. For biosecurity and cybersecurity we set the tamper-resistance loss scale λTR to 4.0, and use λTR = 3.0 for chemical security. We use λretain = 1.0 in all settings. [...] For harmful request refusal training, [...] we use N = 100 outer loop steps, Schedule Free Adam W (Defazio et al., 2024) with an LR of 6 10 5 as the outer loop tamper-resistance optimizer, and loss scales of λTR = 0.1, λretain = 1.0. [...] By default, our attacks use 500 fine-tuning steps. Full details for these adversaries are provided in Table 9. [...] For Biosecurity, [...] LRs are sampled from {2 10 5, 4 10 5}. [...] We fine-tune models for 2 epochs using a learning rate of 2 10 6 and a batch size of 32, using Adam W Schedule Free (Defazio et al., 2024). [...] In cases where the adversary used parameter-efficient fine-tuning (PEFT) via Lo RA adapters (Hu et al., 2021), we used a Lo RA config with an attention dimension, or rank, of 16, a Lo RA alpha value of 32, a Lo RA dropout of 0.05, on target linear modules: { up_proj , down_proj , gate_proj , q_proj , k_proj , v_proj , o_proj }.