Adversaries Can Misuse Combinations of Safe Models
Authors: Erik Jones, Anca Dragan, Jacob Steinhardt
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we empirically show that individually testing models for such misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. ... Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model. |
| Researcher Affiliation | Academia | 1UC Berkeley. Correspondence to: Erik Jones <EMAIL>. |
| Pseudocode | No | The paper describes methods and processes in narrative text and uses diagrams (Figure 1) to illustrate workflows, but it does not contain any clearly labeled pseudocode blocks or algorithms with structured steps. |
| Open Source Code | No | The paper does not provide an explicit statement or link to its own source code for the methodology described. It mentions using 'Hugging Face versions of the weak models on our own compute' and accessing 'frontier systems through APIs', referring to third-party tools and models. |
| Open Datasets | Yes | Appendix A.3: "We include the resulting dataset on Git Hub." (referring to prompts for explicit image generation). Appendix A.4: "This created a dataset of 130 pairs, which is included in the Git Hub." (referring to malicious python script dataset). Appendix A.5: "The historical Twitter data comes from accounts associated with the Internet Research Agency (a Russian disinformation operation) released by (Linvill & Warren, 2020) and accessed on Five Thirty Eight12 https://github.com/fivethirtyeight/russian-troll-tweets" |
| Dataset Splits | No | The paper describes the generation of custom datasets for evaluation, stating numbers of questions or prompts (e.g., "100 programming questions", "130 such examples", "100 prompts per dataset"). It also mentions testing on "100 randomly chosen user-politician combinations, and 100 randomly chosen pair-politician combinations". However, it does not specify explicit training, validation, or test dataset splits in the conventional sense for training a new model, as the work primarily evaluates pre-trained models. |
| Hardware Specification | Yes | We run all of the weak language models Llama 2 7B-chat, 13B-chat, 70B-chat, Mistral 7B instruct, and Mixtral 8x7B instruct on two A100s using VLLM to speed up inference (Kwon et al., 2023). |
| Software Dependencies | Yes | We access GPT-4 and DALL-E 3 through Open AI s API. For GPT-4, we use the gpt-4-0125-preview version of GPT-4-turbo. ... We access all three versions of Claude 3 through Anthropic s API. We use the claude-3-opus-20240229 version of Claude 3 Opus, the claude-3-sonnet-20240229 version of Claude 3 Sonnet, and the claude-3-haiku-20240307 version of Claude 3 Haiku. ... We run all of the weak language models ... on two A100s using VLLM to speed up inference (Kwon et al., 2023). We use the -chat versions of Llama 2 ... the Mistral-7B-Instruct-v0.2 version of Mistral ... and the Mixtral-8x7B-Instruct-v0.1 version of Mixtral ... We use the version of Stable Diffusion v1.5 on Hugging Face ... and the Instruct Pix2Pix model on Hugging Face released by Tim Brooks; this model was fine-tuned from Stable Diffusion v1.5. |
| Experiment Setup | Yes | For all language models, we sample at temperature 0.01 for reproducibility, and adaptively set the maximum number of tokens required for the task. ... For DALL-E 3, we generate images at standard quality at 1024 x 1024 resolution, while otherwise using defaults. ... Instruct Pix2Pix requires setting Image CFG and Text CFG hyperparameters; for all methods, we set the Text CFG to 7.7 and the Image CFG to 1.3; we evaluate this and 7.5 and 1.5 for all methods and baselines for the explicit image generation task, and 7.7 and 1.3 strictly dominated. |