Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

Authors: Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, Wenxuan Zhou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on the Structured Query and Instruction Hierarchy benchmarks demonstrate an average robust accuracy increase of up to 15.75% and 18.68%, respectively.
Researcher Affiliation Collaboration Tong Wu1 Shujian Zhang2 Kaiqiang Song2 Silei Xu2 Sanqiang Zhao2 Ravi Agrawal2 Sathish Indurthi2 Chong Xiang1 Prateek Mittal1 Wenxuan Zhou2 1Princeton University 2Zoom Video Communications
Pseudocode Yes A DETAILS OF IMPLEMENTING INSTRUCTIONAL SEGMENT EMBEDDING Here s an example of implementing Instructional Segment Embedding with a few lines of Python/Pytorch code. The additional code is highlighted in bold blue.
Open Source Code Yes We release our code at https://github.com/tongwu2020/ISE.
Open Datasets Yes Empirically, we conduct comprehensive experiments on two benchmarks: Structured Query (Chen et al., 2024) and Instruction Hierarchy (Wallace et al., 2024), which are constructed based on the Alpaca (Taori et al., 2023) and Ultrachat (Ding et al., 2023) datasets, respectively.
Dataset Splits Yes For the Adversarial Alpaca dataset, we incorporate instructions drawn from other samples (either directly or with a fabricated response) into the data and train the model to ignore such instructions. More details are available in Section B.1. For the Ultra Chat Baseline dataset, we use the Ultra Chat-200K dataset (Ding et al., 2023) and employ GPT-4o to decompose 10K prompts into three components: system instructions, user instructions, and data inputs.
Hardware Specification No The paper does not explicitly mention specific hardware details like GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No Appendix A provides a PyTorch code snippet, but it does not specify the version of PyTorch or any other software dependencies with their version numbers.
Experiment Setup Yes We employ supervised fine-tuning to update all model parameters for all baseline and ISE methods with three epochs. A learning rate of 2e-5 and a cosine learning schedule are used. During inference, we use top-p sampling methods with the model s default settings.