DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints

Authors: Andrew Zhao, Quentin Xu, Matthieu Lin, Shenzhi Wang, Yong-Jin Liu, Zilong Zheng, Gao Huang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate Dive R-CT s marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization.
Researcher Affiliation Collaboration Andrew Zhao1* , Quentin Xu1, Matthieu Lin1, Shenzhi Wang1, Yong-Jin Liu1, Zilong Zheng2, Gao Huang1 1 Tsinghua University 2 Beijing Institute for General Artificial Intelligence (BIGAI)
Pseudocode No The paper describes the methodology and objectives using mathematical formulations and descriptive text, but it does not include any clearly labeled pseudocode or algorithm blocks with structured, step-by-step procedures.
Open Source Code Yes Project Page https://andrewzh112.github.io/diver-ct
Open Datasets Yes We then evaluate the resulting models on the Open LLM Leaderboard benchmarks (Hellaswag, ARC-Challenge, Truthful QA, and Winogrande (Zellers et al. 2019; Clark et al. 2018; Lin, Hilton, and Evans 2022; Sakaguchi et al. 2021)) and red teaming benchmarks: AART, SAP, and Adven Bench (Radharapu et al. 2023; Deng et al. 2023; Zou et al. 2023)
Dataset Splits No The paper mentions combining datasets with a ratio ('We maintain a ratio of 2:1 for the alpaca and toxic refusal data') and using established benchmarks, but it does not provide specific train/test/validation splits (percentages or counts) for its own experimental setup of red teaming data generation or fine-tuning. It implicitly relies on standard splits for cited benchmarks but doesn't specify its own data partitioning methodology.
Hardware Specification No The paper does not explicitly mention specific hardware components such as GPU models (e.g., NVIDIA A100), CPU models (e.g., Intel Xeon), or cloud computing instance types used for running the experiments.
Software Dependencies No The paper mentions various models and frameworks used (e.g., PPO, gpt-4-turbo, vicgalle/gpt2-alpaca-gpt4, Meta-Llama-Guard-2-8B), but it does not specify version numbers for these or other ancillary software components like programming languages or libraries.
Experiment Setup Yes We adopted a similar experimental setup as CRT (Hong et al. 2024), with minimal changes to maintain consistency. We used identical hyperparameters used by CRT, enabling fair comparisons. ... We set τ = 0.5 for all experiments. ... We present the main results using three different thresholds: dsafe { 0.5, 0.7, 0.9} in Section 5.1. ... We empirically found that βsafe = 0.4 matched the ASR of dsafe = 0.7 and βsafe = 0.3 matched dsafe = 0.5. ... We maintain a ratio of 2:1 for the alpaca and toxic refusal data.