Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Authors: Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, Wenxuan Zhou
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on the Structured Query and Instruction Hierarchy benchmarks demonstrate an average robust accuracy increase of up to 15.75% and 18.68%, respectively. |
| Researcher Affiliation | Collaboration | Tong Wu1 Shujian Zhang2 Kaiqiang Song2 Silei Xu2 Sanqiang Zhao2 Ravi Agrawal2 Sathish Indurthi2 Chong Xiang1 Prateek Mittal1 Wenxuan Zhou2 1Princeton University 2Zoom Video Communications |
| Pseudocode | Yes | A DETAILS OF IMPLEMENTING INSTRUCTIONAL SEGMENT EMBEDDING Here s an example of implementing Instructional Segment Embedding with a few lines of Python/Pytorch code. The additional code is highlighted in bold blue. |
| Open Source Code | Yes | We release our code at https://github.com/tongwu2020/ISE. |
| Open Datasets | Yes | Empirically, we conduct comprehensive experiments on two benchmarks: Structured Query (Chen et al., 2024) and Instruction Hierarchy (Wallace et al., 2024), which are constructed based on the Alpaca (Taori et al., 2023) and Ultrachat (Ding et al., 2023) datasets, respectively. |
| Dataset Splits | Yes | For the Adversarial Alpaca dataset, we incorporate instructions drawn from other samples (either directly or with a fabricated response) into the data and train the model to ignore such instructions. More details are available in Section B.1. For the Ultra Chat Baseline dataset, we use the Ultra Chat-200K dataset (Ding et al., 2023) and employ GPT-4o to decompose 10K prompts into three components: system instructions, user instructions, and data inputs. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware details like GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | Appendix A provides a PyTorch code snippet, but it does not specify the version of PyTorch or any other software dependencies with their version numbers. |
| Experiment Setup | Yes | We employ supervised fine-tuning to update all model parameters for all baseline and ISE methods with three epochs. A learning rate of 2e-5 and a cosine learning schedule are used. During inference, we use top-p sampling methods with the model s default settings. |