reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Safety Reasoning with Guidelines

Authors: Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Dacheng Tao, Minhao Cheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our method significantly improves model generalization against OOD attacks. Extensive experiments demonstrate that SRG significantly enhances safety generalization, enabling models to adaptively and robustly handle diverse and evolving OOD attacks.
Researcher Affiliation	Academia	1Tsinghua University 2HKUST 3Nanyang Technological University 4Penn State University. Correspondence to: Zeyu Qin <EMAIL>, Xueqian Wang <EMAIL>.
Pseudocode	No	The paper describes methods and processes in paragraph text and flowcharts (Figure 3, Figure 4) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for its own methodology. It mentions using 'the source code 10 provided by Rep E' for analysis and downloading checkpoints/models for baselines, but no link or statement for the SRG method itself.
Open Datasets	Yes	We use illegal instructions from PKU-Safe RLHF (Ji et al., 2024) and helpful instructions from Ultrafeedback (Cui et al., 2023), with corresponding responses re-generated by GPT-4o. We evaluate six attacks: 1) an ID attack, illegal instructions from Do-Not Answer (Wang et al., 2023) and Harm Bench (Mazeika et al., 2024), and 2) five OOD attacks: Jailbreaking Chat (Shen et al., 2024), Self Cipher (Yuan et al., 2023a), Past Tense (Andriushchenko & Flammarion, 2024), Persuasive Attack (Zeng et al., 2024) and PAIR (Chao et al., 2023). For helpfulness evaluation, we assess coding ability using Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021), math reasoning with GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), and tool usage with BFCL (Yan et al., 2024). We also evaluate over-refusal performance using XSTest dataset (R ottger et al., 2023).
Dataset Splits	Yes	We use two training dataset scales: 1) small-scale, consisting of 0.8K randomly selected illegal instructions and 2.5K helpful instructions; and 2) large-scale, containing 5K illegal instructions and 30K helpful instructions. We evaluate six attacks: (1) 200 illegal instructions from Do-Not Answer (Wang et al., 2023) and Harm Bench (Mazeika et al., 2024) (ID attack); (2) 200 Jailbreak Chat instructions from Do-Anything-Now (Shen et al., 2024) and De RTa (Yuan et al., 2024a) (OOD attack); (3) 200 Self Cipher instructions from Yuan et al. (2023a) (OOD attack); (4) 100 Past Tense attack instructions from Andriushchenko & Flammarion (2024) (OOD attack); (5) 50 Persuasive Jailbreaker attack instructions from Zeng et al. (2024) (OOD attack); and (6) 50 black-box attacks from PAIR (Chao et al., 2023).
Hardware Specification	No	The paper mentions using 8B and 70B models and applying full-parameter finetuning or LoRA, but it does not specify any particular GPU, CPU, or other hardware used for running the experiments.
Software Dependencies	Yes	We use v LLM (Kwon et al., 2023) Version 0.6.3 to inference our models.
Experiment Setup	Yes	The training configuration includes a cutoff length of 4096, a batch size of 64, 3 training epochs, a cosine learning rate scheduler, and a warmup ratio of 0.1. For SFT with Lo RA, we set learning rate to 1e 4. For full finetuning, we set learning rate to 1e 5.