Do LLMs ``know'' internally when they follow instructions?

Authors: Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley Ren, Andrew Miller, Udhyakumar Nallasamy, Jaya Narain

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis identifies a direction in the input embedding space, termed the instruction-following dimension, that predicts whether a response will comply with a given instruction. We find that this dimension generalizes well across unseen tasks but not across unseen instruction types. We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality.
Researcher Affiliation Collaboration Juyeon Heo1,* Christina Heinze-Deml2 Oussama Elachqar2 Kwan Ho Ryan Chan3,* Shirley Ren2 Udhay Nallasamy2 Andy Miller2 Jaya Narain2 1University of Cambridge 2Apple 3University of Pennsylvania EMAIL EMAIL
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and data are available at https://github.com/apple/ml-internal-llms-instruction-following
Open Datasets Yes To objectively evaluate LLMs with simple and verifiable instructions, we select IFEval (Zhou et al., 2023) as our base dataset. The IFEval-simple data is available at https://github.com/apple/ml-internal-llms-instruction-following.
Dataset Splits Yes To evaluate task generalization, we split the data by the task dimension, using a 70-30 train-test split across the 100 tasks. To evaluate instruction-type generalization, we applied a leave-one-out approach, over the instruction-type dimension.
Hardware Specification No The paper mentions analyzing models like LLa MA-2-7B-chat, LLa MA-2-13B-chat, Mistral-7B-Instruct-v0.3, and Phi-3-mini-128k-instruct but does not specify the hardware (e.g., specific GPU or CPU models) used to run the experiments.
Software Dependencies No The paper mentions 'Adam W' as an optimizer and 'GPT-4' for quality assessment, but it does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes A simple linear model was trained on instruction-following success outcome, optimized for 1000 epochs with Adam W, a 0.001 learning rate, and 0.1 weight decay. The selected α values were: 0.3 for Llama-2-chat-13b and Llama-2-chat-7b, 0.1 for Phi-3, and 0.15 for Mistral-7B.