Safety Reasoning with Guidelines
Authors: Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Dacheng Tao, Minhao Cheng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our method significantly improves model generalization against OOD attacks. Extensive experiments demonstrate that SRG significantly enhances safety generalization, enabling models to adaptively and robustly handle diverse and evolving OOD attacks. |
| Researcher Affiliation | Academia | 1Tsinghua University 2HKUST 3Nanyang Technological University 4Penn State University. Correspondence to: Zeyu Qin <EMAIL>, Xueqian Wang <EMAIL>. |
| Pseudocode | No | The paper describes methods and processes in paragraph text and flowcharts (Figure 3, Figure 4) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for its own methodology. It mentions using 'the source code 10 provided by Rep E' for analysis and downloading checkpoints/models for baselines, but no link or statement for the SRG method itself. |
| Open Datasets | Yes | We use illegal instructions from PKU-Safe RLHF (Ji et al., 2024) and helpful instructions from Ultrafeedback (Cui et al., 2023), with corresponding responses re-generated by GPT-4o. We evaluate six attacks: 1) an ID attack, illegal instructions from Do-Not Answer (Wang et al., 2023) and Harm Bench (Mazeika et al., 2024), and 2) five OOD attacks: Jailbreaking Chat (Shen et al., 2024), Self Cipher (Yuan et al., 2023a), Past Tense (Andriushchenko & Flammarion, 2024), Persuasive Attack (Zeng et al., 2024) and PAIR (Chao et al., 2023). For helpfulness evaluation, we assess coding ability using Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021), math reasoning with GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), and tool usage with BFCL (Yan et al., 2024). We also evaluate over-refusal performance using XSTest dataset (R ottger et al., 2023). |
| Dataset Splits | Yes | We use two training dataset scales: 1) small-scale, consisting of 0.8K randomly selected illegal instructions and 2.5K helpful instructions; and 2) large-scale, containing 5K illegal instructions and 30K helpful instructions. We evaluate six attacks: (1) 200 illegal instructions from Do-Not Answer (Wang et al., 2023) and Harm Bench (Mazeika et al., 2024) (ID attack); (2) 200 Jailbreak Chat instructions from Do-Anything-Now (Shen et al., 2024) and De RTa (Yuan et al., 2024a) (OOD attack); (3) 200 Self Cipher instructions from Yuan et al. (2023a) (OOD attack); (4) 100 Past Tense attack instructions from Andriushchenko & Flammarion (2024) (OOD attack); (5) 50 Persuasive Jailbreaker attack instructions from Zeng et al. (2024) (OOD attack); and (6) 50 black-box attacks from PAIR (Chao et al., 2023). |
| Hardware Specification | No | The paper mentions using 8B and 70B models and applying full-parameter finetuning or LoRA, but it does not specify any particular GPU, CPU, or other hardware used for running the experiments. |
| Software Dependencies | Yes | We use v LLM (Kwon et al., 2023) Version 0.6.3 to inference our models. |
| Experiment Setup | Yes | The training configuration includes a cutoff length of 4096, a batch size of 64, 3 training epochs, a cosine learning rate scheduler, and a warmup ratio of 0.1. For SFT with Lo RA, we set learning rate to 1e 4. For full finetuning, we set learning rate to 1e 5. |