reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

Authors: Muxi Diao, Rumei Li, Shiyang Liu, Guogang Liao, Jingang Wang, Xunliang Cai, Weiran Xu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contributions include a novel adversarial framework, a comprehensive safety dataset, and empirical evidence demonstrating the effectiveness of SEAS.
Researcher Affiliation	Collaboration	Muxi Diao1, Rumei Li2, Shiyang Liu2, Guogang Liao2, Jingang Wang2, Xunliang Cai2, Weiran Xu1* 1Beijing University of Posts and Telecommunications, Beijing, China 2Meituan, Beijing, China EMAIL
Pseudocode	No	The paper describes the SEAS pipeline through a textual explanation and a diagram (Figure 1), and provides mathematical equations for optimization loss, but it does not include a distinct pseudocode block or algorithm section with structured steps.
Open Source Code	Yes	Project Homepage https://SEAS-LLM.github.io/ We also open-source the training code, facilitating researchers to replicate and validate our findings.
Open Datasets	Yes	To address this gap, we have developed a SEAS dataset, which features 14 types that cover two risk dimensions... This dataset contains 18K entries... SEAS dataset were collected through crowdsourcing platforms, manually rewritten and labeled, and augmented with some open-source safety data (Liu et al. 2023; Tedeschi et al. 2024; Bhardwaj and Poria 2023). ... To construct the Target model, we selected three highquality open-source general instruction-following datasets as seed datasets: Share GPT (Chiang et al. 2023), Dolly (Conover et al. 2023) and LIMA (Zhou et al. 2023a). ... General data came from two highquality paired open-source datasets, namely Open Orca (Lian et al. 2023) and ORPO-DPO-Mix (Labonne 2024).
Dataset Splits	Yes	This dataset contains 18K entries, divided into a training set with 16K entries and an In-Domain test set with 2K entries. ... Overall, we used approximately 101K cleaned samples from these datasets. ... At the beginning of each Attack Stage, we constructed a dataset of 5K seed prompts. ... Each Attack Stage ultimately resulted in a dataset containing 125K entries. ... we collect approximately 4.8K data pairs per round for iterative optimization of Red Team model. ... we selected 2K pairs of safe data and randomly mixed them with general data for training. ... we mixed 7K pairs of general data in iteration 1, and 14K pairs of data in both iterations 2 and 3.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU types, or memory used for running the experiments. It mentions using LLMs like Llama-3-8B and GPT-4, but not the underlying hardware for their own experimental setup.
Software Dependencies	No	The paper mentions using "Meta Llama Guard 2 (Meta 2024)" as a safety classifier, but it does not specify any general software dependencies or programming language versions (e.g., Python, PyTorch, TensorFlow) with version numbers.
Experiment Setup	Yes	For the Red Team model, we expect it to generate complex and diverse adversarial prompts. To achieve this, we adopted an initialization scheme based on random sample contexts. ... we adopted nucleus sampling (Holtzman et al. 2019) and carried out multiple samplings to generate n prompts. ... Here, we use Direct Preference Optimization (DPO) loss (Rafailov et al. 2024). ... At the beginning of each Attack Stage, we constructed a dataset of 5K seed prompts. Seed prompts were created by randomly selecting 3 (k = 3) prompts of the same type from the training set. During the generation process of both Red Team and Target models, we employed settings of T = 0.8 and p = 0.9, and sampled each model 5 times (n = m = 5). When conducting safety evaluation with the Safe Classifier, we utilized a greedy strategy (Sutskever, Vinyals, and Le 2014). ... To balance the generality and security of Target model, we selected 2K pairs of safe data and randomly mixed them with general data for training. ... In detail, we mixed 7K pairs of general data in iteration 1, and 14K pairs of data in both iterations 2 and 3.