DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints
Authors: Andrew Zhao, Quentin Xu, Matthieu Lin, Shenzhi Wang, Yong-Jin Liu, Zilong Zheng, Gao Huang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate Dive R-CT s marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. |
| Researcher Affiliation | Collaboration | Andrew Zhao1* , Quentin Xu1, Matthieu Lin1, Shenzhi Wang1, Yong-Jin Liu1, Zilong Zheng2, Gao Huang1 1 Tsinghua University 2 Beijing Institute for General Artificial Intelligence (BIGAI) |
| Pseudocode | No | The paper describes the methodology and objectives using mathematical formulations and descriptive text, but it does not include any clearly labeled pseudocode or algorithm blocks with structured, step-by-step procedures. |
| Open Source Code | Yes | Project Page https://andrewzh112.github.io/diver-ct |
| Open Datasets | Yes | We then evaluate the resulting models on the Open LLM Leaderboard benchmarks (Hellaswag, ARC-Challenge, Truthful QA, and Winogrande (Zellers et al. 2019; Clark et al. 2018; Lin, Hilton, and Evans 2022; Sakaguchi et al. 2021)) and red teaming benchmarks: AART, SAP, and Adven Bench (Radharapu et al. 2023; Deng et al. 2023; Zou et al. 2023) |
| Dataset Splits | No | The paper mentions combining datasets with a ratio ('We maintain a ratio of 2:1 for the alpaca and toxic refusal data') and using established benchmarks, but it does not provide specific train/test/validation splits (percentages or counts) for its own experimental setup of red teaming data generation or fine-tuning. It implicitly relies on standard splits for cited benchmarks but doesn't specify its own data partitioning methodology. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware components such as GPU models (e.g., NVIDIA A100), CPU models (e.g., Intel Xeon), or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions various models and frameworks used (e.g., PPO, gpt-4-turbo, vicgalle/gpt2-alpaca-gpt4, Meta-Llama-Guard-2-8B), but it does not specify version numbers for these or other ancillary software components like programming languages or libraries. |
| Experiment Setup | Yes | We adopted a similar experimental setup as CRT (Hong et al. 2024), with minimal changes to maintain consistency. We used identical hyperparameters used by CRT, enabling fair comparisons. ... We set τ = 0.5 for all experiments. ... We present the main results using three different thresholds: dsafe { 0.5, 0.7, 0.9} in Section 5.1. ... We empirically found that βsafe = 0.4 matched the ASR of dsafe = 0.7 and βsafe = 0.3 matched dsafe = 0.5. ... We maintain a ratio of 2:1 for the alpaca and toxic refusal data. |