Controllable Context Sensitivity and the Knob Behind It

Authors: Julian Minder, Kevin Du, Niklas Stoehr, Giovanni Monea, Chris Wendler, Robert West, Ryan Cotterell

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When fine-tuned on this task, instruct versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85 95%). Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm.
Researcher Affiliation Academia DETH Zürich @EPFL NCornell University
Pseudocode Yes We provide Python-esque pseudocode for our search algorithm in App. A.1. Listing 1: Search Algorithm.
Open Source Code Yes We provide code to reproduce all datasets, experiments, and analysis at https://github.com/ kdu4108/context-vs-prior-finetuning.
Open Datasets Yes Following the task formulation in 3.1, we construct intent-augmented datasets, CCS-BF, CCS-MH, and CCS-AR, based on the query-context pairs in BASEFAKEPEDIA, MULTIHOPFAKEPEDIA (Monea et al., 2024), and ARITHMETIC.
Dataset Splits Yes Let Strn Q C and Stst Q C be disjoint training and testing sets of query context pairs. Models are trained on F (q, c, pri) a(q, ε) and F (q, c, ctx) a(q, c) for (q, c) Strn, where denotes concatenation. ... Training set size: 2048 examples.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing specifications used for running the experiments.
Software Dependencies No We build on pyvene (Wu et al., 2024) to train the projection. ... apply Py Torch s orthogonal parametrization4 to enforce orthonormal columns in A. The paper mentions `pyvene` and `PyTorch` but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes To fine-tune models in the CCS-BF task, we use QLo RA with the following hyperparameters: Effective batch size (after gradient accumulation): 16; Optimizer: Adam W (8-bit); Learning rate: 2e 4; QLo RA hyperparameters: attention head projection matrices in all layers; Training set size: 2048 examples.