Do LLMs ``know'' internally when they follow instructions?
Authors: Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley Ren, Andrew Miller, Udhyakumar Nallasamy, Jaya Narain
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our analysis identifies a direction in the input embedding space, termed the instruction-following dimension, that predicts whether a response will comply with a given instruction. We find that this dimension generalizes well across unseen tasks but not across unseen instruction types. We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality. |
| Researcher Affiliation | Collaboration | Juyeon Heo1,* Christina Heinze-Deml2 Oussama Elachqar2 Kwan Ho Ryan Chan3,* Shirley Ren2 Udhay Nallasamy2 Andy Miller2 Jaya Narain2 1University of Cambridge 2Apple 3University of Pennsylvania EMAIL EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data are available at https://github.com/apple/ml-internal-llms-instruction-following |
| Open Datasets | Yes | To objectively evaluate LLMs with simple and verifiable instructions, we select IFEval (Zhou et al., 2023) as our base dataset. The IFEval-simple data is available at https://github.com/apple/ml-internal-llms-instruction-following. |
| Dataset Splits | Yes | To evaluate task generalization, we split the data by the task dimension, using a 70-30 train-test split across the 100 tasks. To evaluate instruction-type generalization, we applied a leave-one-out approach, over the instruction-type dimension. |
| Hardware Specification | No | The paper mentions analyzing models like LLa MA-2-7B-chat, LLa MA-2-13B-chat, Mistral-7B-Instruct-v0.3, and Phi-3-mini-128k-instruct but does not specify the hardware (e.g., specific GPU or CPU models) used to run the experiments. |
| Software Dependencies | No | The paper mentions 'Adam W' as an optimizer and 'GPT-4' for quality assessment, but it does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | A simple linear model was trained on instruction-following success outcome, optimized for 1000 epochs with Adam W, a 0.001 learning rate, and 0.1 weight decay. The selected α values were: 0.3 for Llama-2-chat-13b and Llama-2-chat-7b, 0.1 for Phi-3, and 0.15 for Mistral-7B. |