Improving Instruction-Following in Language Models through Activation Steering

Authors: Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. We conduct experiments using the Phi-3 (Abdin et al., 2024), Gemma 2 2B and 9B (Gemma Team, 2024), and Mistral 7B (Jiang et al., 2023) models, focusing on three types of instructions: output format ( 3), output length ( 4), and the inclusion/exclusion of specific words ( 5). Our results on the IFEval dataset (Zhou et al., 2023a) provide evidence that vector representations can encode a wide range of instructions and enhance the model s instruction-following performance.
Researcher Affiliation Collaboration Alessandro Stolfo1 Vidhisha Balachandran2 Safoora Yousefi2 Eric Horvitz2 Besmira Nushi2 1ETH Z urich 2Microsoft Research
Pseudocode No The paper describes the steering procedure in Section 2.2 with mathematical formulas and prose, but no explicit pseudocode block or algorithm steps are presented in a structured format.
Open Source Code Yes Our code and data are available at https://github.com/microsoft/llm-steer-instruct.
Open Datasets Yes We use an augmented version of the IFEval dataset (Zhou et al., 2023a), which consists of 25 distinct instructions, each paired with multiple base queries and expressed in different phrasings, for a total of 541 prompts.
Dataset Splits Yes For steering vector computation and layer selection, we construct a separate set of synthetically generated prompts by combining base queries from IFEval with corresponding instruction descriptions to avoid test information leakage. Additional details about the data used are provided in Apps. C and D. For validation, we use a set of 96 examples (8 per instruction), sampled from the synthetically data described above. For length instructions, the steering vectors are computed using a fixed set of 50 IFEval base queries. The evaluation data used in 4 and Appendix J consists of a separate set of 200 base queries. For validation, we use GPT-4o to synthetically generate a set of questions similar to the base queries in IFEval. Additionally, the prompt requests the generation of a list of words likely to appear in the answer to each question. These question-word pairs (276 in total) are used for grid search validation of word inclusion and exclusion. For evaluation, we use IFEval examples containing keyword inclusion and exclusion instructions... This process yields 86 evaluation prompts for keyword inclusion and 117 for keyword exclusion.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory amounts, or detailed computer specifications) used for running the experiments. It mentions various language models used but not the underlying computational resources.
Software Dependencies No Our experiments were carried out using Py Torch (Paszke et al., 2019) and the Transformers Lens library (Nanda & Bloom, 2022). We performed our data analysis using Num Py (Harris et al., 2020) and Pandas (Wes Mc Kinney, 2010). Our figures were made using Plotly (Plotly Technologies Inc., 2015). The paper lists several software libraries with citations but does not provide specific version numbers for these libraries (e.g., 'PyTorch 1.x' rather than just 'PyTorch (Paszke et al., 2019)').
Experiment Setup Yes For format instructions, we use a systematic scaling approach where the value of c is selected to ensure that the residual stream activations are mapped to their mean value on inputs that contain the instruction in question. For length instructions, which have a more continuous nature, we experiment and show results with different values of c, illustrating their impact on the model s output. Finally, for word-specific constraints, we compute the weight using Eq. (2) and additionally perform a small grid search over neighboring values on a held-out set of examples to fine-tune the steering effect. The steering vector cul is then added to the corresponding residual stream layer and the forward pass is resumed with the updated residual stream value x l = x l + cul. Model outputs are decoded greedily, with a maximum generation length of 2048 tokens for format and length experiments, and 1024 tokens for keyword experiments. For efficiency, validation runs use a reduced maximum length of 384 tokens.