STAIR: Improving Safety Alignment with Introspective Reasoning

Authors: Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. We conduct extensive experiments to assess the effectiveness of STAIR. In terms of safety, STAIR consistently enhances the resistance to various harmful queries, achieving a goodness score of 0.88 on Strong Reject for LLa MA, outperforming the best baseline by 0.15.
Researcher Affiliation Collaboration 1Department of Computer Science and Technology, College of AI, Institute for AI, Tsinghua-Bosch Joint ML Center, THBI Lab, BNRist Center, Tsinghua University, Beijing, 100084, China. 2Real AI 3Institute of Artificial Intelligence, Beihang University, Beijing, 100191, China 4Alibaba Group 5Baichuan AI. Correspondence to: Yinpeng Dong <EMAIL>, Jun Zhu <EMAIL>.
Pseudocode Yes The whole procedure of Safety-Informed MCTS follows Algorithm 1. In practice, we set exploration parameter c = 1.5, search budget n = 200, children number m = 4. To generate child nodes and rollout to final answers, we set temperature as 1.2, top-p as 0.9 and top-k as 50. We adjust these parameters when higher diversity is needed. Algorithm 1 Safety-Informed MCTS
Open Source Code Yes We have opensourced our code, datasets and models at https: //github.com/thu-ml/STAIR.
Open Datasets Yes We have opensourced our code, datasets and models at https: //github.com/thu-ml/STAIR. We take a dataset D comprising 50k samples from three sources. For safety-focused data, we use a modified version of 22k preference samples from PKU-Safe RLHF (Ji et al., 2024b) along with 3k jailbreak data from Jailbreak V28k (Luo et al., 2024b). Additionally, 25k pairwise data are drawn from Ultra Feedback (Cui et al., 2024) to maintain helpfulness, as done in prior works (Qi et al., 2025; Wu et al., 2024).
Dataset Splits Yes We use 10 popular benchmarks to evaluate harmlessness and general performance of the trained models. For Strong Reject (Souly et al., 2024), we take the official evaluation protocol, which uses GPT-4o to evaluate the responses and gives a rubric-based score reflecting the willingness and capabilities in responding to harmful queries. For Xs Test (Röttger et al., 2024), we select the unsafe split to evaluate the resistance to normal harmful queries and follow its official implementation on refusal determination with GPT-4. All benchmarks are evaluated following official implementations.
Hardware Specification Yes In this work, we conduct all our experiments on clusters with 8 NVIDIA A800 GPUs.
Software Dependencies No The paper mentions tools like LLa MA-Factory (Zheng et al., 2024) and Open RLHF (Hu et al., 2024) but does not provide specific version numbers for these software components.
Experiment Setup Yes For all methods in training LLMs, optimization with SFT is for 3 epochs and that with DPO is for 1 epoch by default. We tune the learning rate from {5e 7, 1e 6, 5e 6} and β for DPO from {0.1, 0.2, 0.4}. Batch size is fixed as 128 and weight decay is set to 0. We adopt a cosine scheduler with a warm-up ratio of 0.1. Following the official implementation, we set β = 0.1 and β/λ = 0.025 for SACPO. For Self-Rewarding and our self-improving framework, we take K = 3 iterations. We take an auxiliary SFT loss with a coefficient of 0.2 in our self-improvement to preserve the structured Co T style.