reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Do LLMs ``know'' internally when they follow instructions?

Authors: Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley Ren, Andrew Miller, Udhyakumar Nallasamy, Jaya Narain

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our analysis identifies a direction in the input embedding space, termed the instruction-following dimension, that predicts whether a response will comply with a given instruction. We find that this dimension generalizes well across unseen tasks but not across unseen instruction types. We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality.
Researcher Affiliation	Collaboration	Juyeon Heo1,* Christina Heinze-Deml2 Oussama Elachqar2 Kwan Ho Ryan Chan3,* Shirley Ren2 Udhay Nallasamy2 Andy Miller2 Jaya Narain2 1University of Cambridge 2Apple 3University of Pennsylvania EMAIL EMAIL
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data are available at https://github.com/apple/ml-internal-llms-instruction-following
Open Datasets	Yes	To objectively evaluate LLMs with simple and verifiable instructions, we select IFEval (Zhou et al., 2023) as our base dataset. The IFEval-simple data is available at https://github.com/apple/ml-internal-llms-instruction-following.
Dataset Splits	Yes	To evaluate task generalization, we split the data by the task dimension, using a 70-30 train-test split across the 100 tasks. To evaluate instruction-type generalization, we applied a leave-one-out approach, over the instruction-type dimension.
Hardware Specification	No	The paper mentions analyzing models like LLa MA-2-7B-chat, LLa MA-2-13B-chat, Mistral-7B-Instruct-v0.3, and Phi-3-mini-128k-instruct but does not specify the hardware (e.g., specific GPU or CPU models) used to run the experiments.
Software Dependencies	No	The paper mentions 'Adam W' as an optimizer and 'GPT-4' for quality assessment, but it does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	A simple linear model was trained on instruction-following success outcome, optimized for 1000 epochs with Adam W, a 0.001 learning rate, and 0.1 weight decay. The selected α values were: 0.3 for Llama-2-chat-13b and Llama-2-chat-7b, 0.1 for Phi-3, and 0.15 for Mistral-7B.