reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adversaries Can Misuse Combinations of Safe Models

Authors: Erik Jones, Anca Dragan, Jacob Steinhardt

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we empirically show that individually testing models for such misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. ... Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model.
Researcher Affiliation	Academia	1UC Berkeley. Correspondence to: Erik Jones <EMAIL>.
Pseudocode	No	The paper describes methods and processes in narrative text and uses diagrams (Figure 1) to illustrate workflows, but it does not contain any clearly labeled pseudocode blocks or algorithms with structured steps.
Open Source Code	No	The paper does not provide an explicit statement or link to its own source code for the methodology described. It mentions using 'Hugging Face versions of the weak models on our own compute' and accessing 'frontier systems through APIs', referring to third-party tools and models.
Open Datasets	Yes	Appendix A.3: "We include the resulting dataset on Git Hub." (referring to prompts for explicit image generation). Appendix A.4: "This created a dataset of 130 pairs, which is included in the Git Hub." (referring to malicious python script dataset). Appendix A.5: "The historical Twitter data comes from accounts associated with the Internet Research Agency (a Russian disinformation operation) released by (Linvill & Warren, 2020) and accessed on Five Thirty Eight12 https://github.com/fivethirtyeight/russian-troll-tweets"
Dataset Splits	No	The paper describes the generation of custom datasets for evaluation, stating numbers of questions or prompts (e.g., "100 programming questions", "130 such examples", "100 prompts per dataset"). It also mentions testing on "100 randomly chosen user-politician combinations, and 100 randomly chosen pair-politician combinations". However, it does not specify explicit training, validation, or test dataset splits in the conventional sense for training a new model, as the work primarily evaluates pre-trained models.
Hardware Specification	Yes	We run all of the weak language models Llama 2 7B-chat, 13B-chat, 70B-chat, Mistral 7B instruct, and Mixtral 8x7B instruct on two A100s using VLLM to speed up inference (Kwon et al., 2023).
Software Dependencies	Yes	We access GPT-4 and DALL-E 3 through Open AI s API. For GPT-4, we use the gpt-4-0125-preview version of GPT-4-turbo. ... We access all three versions of Claude 3 through Anthropic s API. We use the claude-3-opus-20240229 version of Claude 3 Opus, the claude-3-sonnet-20240229 version of Claude 3 Sonnet, and the claude-3-haiku-20240307 version of Claude 3 Haiku. ... We run all of the weak language models ... on two A100s using VLLM to speed up inference (Kwon et al., 2023). We use the -chat versions of Llama 2 ... the Mistral-7B-Instruct-v0.2 version of Mistral ... and the Mixtral-8x7B-Instruct-v0.1 version of Mixtral ... We use the version of Stable Diffusion v1.5 on Hugging Face ... and the Instruct Pix2Pix model on Hugging Face released by Tim Brooks; this model was fine-tuned from Stable Diffusion v1.5.
Experiment Setup	Yes	For all language models, we sample at temperature 0.01 for reproducibility, and adaptively set the maximum number of tokens required for the task. ... For DALL-E 3, we generate images at standard quality at 1024 x 1024 resolution, while otherwise using defaults. ... Instruct Pix2Pix requires setting Image CFG and Text CFG hyperparameters; for all methods, we set the Text CFG to 7.7 and the Image CFG to 1.3; we evaluate this and 7.5 and 1.5 for all methods and baselines for the explicit image generation task, and 7.7 and 1.3 strictly dominated.