reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints

Authors: Andrew Zhao, Quentin Xu, Matthieu Lin, Shenzhi Wang, Yong-Jin Liu, Zilong Zheng, Gao Huang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate Dive R-CT s marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization.
Researcher Affiliation	Collaboration	Andrew Zhao1* , Quentin Xu1, Matthieu Lin1, Shenzhi Wang1, Yong-Jin Liu1, Zilong Zheng2, Gao Huang1 1 Tsinghua University 2 Beijing Institute for General Artificial Intelligence (BIGAI)
Pseudocode	No	The paper describes the methodology and objectives using mathematical formulations and descriptive text, but it does not include any clearly labeled pseudocode or algorithm blocks with structured, step-by-step procedures.
Open Source Code	Yes	Project Page https://andrewzh112.github.io/diver-ct
Open Datasets	Yes	We then evaluate the resulting models on the Open LLM Leaderboard benchmarks (Hellaswag, ARC-Challenge, Truthful QA, and Winogrande (Zellers et al. 2019; Clark et al. 2018; Lin, Hilton, and Evans 2022; Sakaguchi et al. 2021)) and red teaming benchmarks: AART, SAP, and Adven Bench (Radharapu et al. 2023; Deng et al. 2023; Zou et al. 2023)
Dataset Splits	No	The paper mentions combining datasets with a ratio ('We maintain a ratio of 2:1 for the alpaca and toxic refusal data') and using established benchmarks, but it does not provide specific train/test/validation splits (percentages or counts) for its own experimental setup of red teaming data generation or fine-tuning. It implicitly relies on standard splits for cited benchmarks but doesn't specify its own data partitioning methodology.
Hardware Specification	No	The paper does not explicitly mention specific hardware components such as GPU models (e.g., NVIDIA A100), CPU models (e.g., Intel Xeon), or cloud computing instance types used for running the experiments.
Software Dependencies	No	The paper mentions various models and frameworks used (e.g., PPO, gpt-4-turbo, vicgalle/gpt2-alpaca-gpt4, Meta-Llama-Guard-2-8B), but it does not specify version numbers for these or other ancillary software components like programming languages or libraries.
Experiment Setup	Yes	We adopted a similar experimental setup as CRT (Hong et al. 2024), with minimal changes to maintain consistency. We used identical hyperparameters used by CRT, enabling fair comparisons. ... We set τ = 0.5 for all experiments. ... We present the main results using three different thresholds: dsafe { 0.5, 0.7, 0.9} in Section 5.1. ... We empirically found that βsafe = 0.4 matched the ASR of dsafe = 0.7 and βsafe = 0.3 matched dsafe = 0.5. ... We maintain a ratio of 2:1 for the alpaca and toxic refusal data.