Safety Reasoning with Guidelines

Authors: Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Dacheng Tao, Minhao Cheng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our method significantly improves model generalization against OOD attacks. Extensive experiments demonstrate that SRG significantly enhances safety generalization, enabling models to adaptively and robustly handle diverse and evolving OOD attacks.
Researcher Affiliation Academia 1Tsinghua University 2HKUST 3Nanyang Technological University 4Penn State University. Correspondence to: Zeyu Qin <EMAIL>, Xueqian Wang <EMAIL>.
Pseudocode No The paper describes methods and processes in paragraph text and flowcharts (Figure 3, Figure 4) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for its own methodology. It mentions using 'the source code 10 provided by Rep E' for analysis and downloading checkpoints/models for baselines, but no link or statement for the SRG method itself.
Open Datasets Yes We use illegal instructions from PKU-Safe RLHF (Ji et al., 2024) and helpful instructions from Ultrafeedback (Cui et al., 2023), with corresponding responses re-generated by GPT-4o. We evaluate six attacks: 1) an ID attack, illegal instructions from Do-Not Answer (Wang et al., 2023) and Harm Bench (Mazeika et al., 2024), and 2) five OOD attacks: Jailbreaking Chat (Shen et al., 2024), Self Cipher (Yuan et al., 2023a), Past Tense (Andriushchenko & Flammarion, 2024), Persuasive Attack (Zeng et al., 2024) and PAIR (Chao et al., 2023). For helpfulness evaluation, we assess coding ability using Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021), math reasoning with GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), and tool usage with BFCL (Yan et al., 2024). We also evaluate over-refusal performance using XSTest dataset (R ottger et al., 2023).
Dataset Splits Yes We use two training dataset scales: 1) small-scale, consisting of 0.8K randomly selected illegal instructions and 2.5K helpful instructions; and 2) large-scale, containing 5K illegal instructions and 30K helpful instructions. We evaluate six attacks: (1) 200 illegal instructions from Do-Not Answer (Wang et al., 2023) and Harm Bench (Mazeika et al., 2024) (ID attack); (2) 200 Jailbreak Chat instructions from Do-Anything-Now (Shen et al., 2024) and De RTa (Yuan et al., 2024a) (OOD attack); (3) 200 Self Cipher instructions from Yuan et al. (2023a) (OOD attack); (4) 100 Past Tense attack instructions from Andriushchenko & Flammarion (2024) (OOD attack); (5) 50 Persuasive Jailbreaker attack instructions from Zeng et al. (2024) (OOD attack); and (6) 50 black-box attacks from PAIR (Chao et al., 2023).
Hardware Specification No The paper mentions using 8B and 70B models and applying full-parameter finetuning or LoRA, but it does not specify any particular GPU, CPU, or other hardware used for running the experiments.
Software Dependencies Yes We use v LLM (Kwon et al., 2023) Version 0.6.3 to inference our models.
Experiment Setup Yes The training configuration includes a cutoff length of 4096, a batch size of 64, 3 training epochs, a cosine learning rate scheduler, and a warmup ratio of 0.1. For SFT with Lo RA, we set learning rate to 1e 4. For full finetuning, we set learning rate to 1e 5.