SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models
Authors: Muxi Diao, Rumei Li, Shiyang Liu, Guogang Liao, Jingang Wang, Xunliang Cai, Weiran Xu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our contributions include a novel adversarial framework, a comprehensive safety dataset, and empirical evidence demonstrating the effectiveness of SEAS. |
| Researcher Affiliation | Collaboration | Muxi Diao1, Rumei Li2, Shiyang Liu2, Guogang Liao2, Jingang Wang2, Xunliang Cai2, Weiran Xu1* 1Beijing University of Posts and Telecommunications, Beijing, China 2Meituan, Beijing, China EMAIL |
| Pseudocode | No | The paper describes the SEAS pipeline through a textual explanation and a diagram (Figure 1), and provides mathematical equations for optimization loss, but it does not include a distinct pseudocode block or algorithm section with structured steps. |
| Open Source Code | Yes | Project Homepage https://SEAS-LLM.github.io/ We also open-source the training code, facilitating researchers to replicate and validate our findings. |
| Open Datasets | Yes | To address this gap, we have developed a SEAS dataset, which features 14 types that cover two risk dimensions... This dataset contains 18K entries... SEAS dataset were collected through crowdsourcing platforms, manually rewritten and labeled, and augmented with some open-source safety data (Liu et al. 2023; Tedeschi et al. 2024; Bhardwaj and Poria 2023). ... To construct the Target model, we selected three highquality open-source general instruction-following datasets as seed datasets: Share GPT (Chiang et al. 2023), Dolly (Conover et al. 2023) and LIMA (Zhou et al. 2023a). ... General data came from two highquality paired open-source datasets, namely Open Orca (Lian et al. 2023) and ORPO-DPO-Mix (Labonne 2024). |
| Dataset Splits | Yes | This dataset contains 18K entries, divided into a training set with 16K entries and an In-Domain test set with 2K entries. ... Overall, we used approximately 101K cleaned samples from these datasets. ... At the beginning of each Attack Stage, we constructed a dataset of 5K seed prompts. ... Each Attack Stage ultimately resulted in a dataset containing 125K entries. ... we collect approximately 4.8K data pairs per round for iterative optimization of Red Team model. ... we selected 2K pairs of safe data and randomly mixed them with general data for training. ... we mixed 7K pairs of general data in iteration 1, and 14K pairs of data in both iterations 2 and 3. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU types, or memory used for running the experiments. It mentions using LLMs like Llama-3-8B and GPT-4, but not the underlying hardware for their own experimental setup. |
| Software Dependencies | No | The paper mentions using "Meta Llama Guard 2 (Meta 2024)" as a safety classifier, but it does not specify any general software dependencies or programming language versions (e.g., Python, PyTorch, TensorFlow) with version numbers. |
| Experiment Setup | Yes | For the Red Team model, we expect it to generate complex and diverse adversarial prompts. To achieve this, we adopted an initialization scheme based on random sample contexts. ... we adopted nucleus sampling (Holtzman et al. 2019) and carried out multiple samplings to generate n prompts. ... Here, we use Direct Preference Optimization (DPO) loss (Rafailov et al. 2024). ... At the beginning of each Attack Stage, we constructed a dataset of 5K seed prompts. Seed prompts were created by randomly selecting 3 (k = 3) prompts of the same type from the training set. During the generation process of both Red Team and Target models, we employed settings of T = 0.8 and p = 0.9, and sampled each model 5 times (n = m = 5). When conducting safety evaluation with the Safe Classifier, we utilized a greedy strategy (Sutskever, Vinyals, and Le 2014). ... To balance the generality and security of Target model, we selected 2K pairs of safe data and randomly mixed them with general data for training. ... In detail, we mixed 7K pairs of general data in iteration 1, and 14K pairs of data in both iterations 2 and 3. |