reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Reasoning Performance in Large Language Models via Representation Engineering

Authors: Bertram Højer, Oliver Jarvis, Stefan Heinrich

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply control vectors to Mistral-7B-Instruct and a range of Pythia models on an inductive, a deductive and mathematical reasoning task. We show that an LLM can, to a certain degree, be controlled to improve its perceived reasoning ability by modulating activations. The intervention is dependent upon the ability to reliably extract the model s typical state when correctly solving a task. Our results suggest that reasoning performance can be modulated in the same manner as other information-processing tasks performed by LLMs and demonstrate that we are capable of improving performance on specific tasks via a simple intervention on the residual stream with no additional training.
Researcher Affiliation	Academia	Bertram Højer , Oliver Jarvis , Stefan Heinrich Department of Computer Science, IT University of Copenhagen, Denmark EMAIL
Pseudocode	No	The paper does not contain explicit pseudocode or algorithm blocks. It provides mathematical equations for defining control vectors and transformer operations (Equation 1, 2, 3) but these are not formatted as pseudocode.
Open Source Code	Yes	We publish the code for deriving control vectors and analyzing model representations.1 The method allows us to improve performance on reasoning benchmarks and assess how control vectors influence the final logit distribution of a model via metrics such as KL divergence and entropy. We apply control vectors to Mistral-7B-Instruct and a range of Pythia models on an inductive, a deductive and mathematical reasoning task. 1code: https://github.com/bertramhojer/improve-reasoning-iclr-2025
Open Datasets	Yes	b Ab I comprises various reasoning tasks, one related to deductive reasoning of which there are 2, 000 examples. GSM8K consists of high quality grade school math problems on which relatively capable LLMs still struggle. We do not perform any additional pre-processing and have downloaded the data directly from https://huggingface.co/datasets/Muennighoff/babi.
Dataset Splits	Yes	For each dataset we create train and test splits with stratified labels. We then derive the control vector based on model representations when it generates outputs on examples from the train split and test model performance with a control vector applied on the test set. We used an 80/20 train-test split to train and evaluate the performance on control vectors.
Hardware Specification	No	The paper mentions using pre-trained models (Pythia-1.4B, Pythia-2.8B, Mistral-7B-Instruct) but does not specify the hardware used to run the experiments or extract activations.
Software Dependencies	No	Our framework is built as a wrapper around Py Torch, enabling easy extraction of hidden dimension representations and application of control vectors. The models described in section 3.3 were loaded using the Hugging Face API and details on model version are described there.
Experiment Setup	Yes	We generally assess the impact of the intervention at α [ 1, 1] at increments of 0.1, but look at a range of [ 3, 3] for Mistral-7B-Instruct. When deriving control vectors we get a control vector for each layer, previous work however indicates that only applying the vectors to the middle layer is enough to induce strong changes to model outputs (Templeton et al., 2024).