Controllable Context Sensitivity and the Knob Behind It
Authors: Julian Minder, Kevin Du, Niklas Stoehr, Giovanni Monea, Chris Wendler, Robert West, Ryan Cotterell
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When fine-tuned on this task, instruct versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85 95%). Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm. |
| Researcher Affiliation | Academia | DETH Zürich @EPFL NCornell University |
| Pseudocode | Yes | We provide Python-esque pseudocode for our search algorithm in App. A.1. Listing 1: Search Algorithm. |
| Open Source Code | Yes | We provide code to reproduce all datasets, experiments, and analysis at https://github.com/ kdu4108/context-vs-prior-finetuning. |
| Open Datasets | Yes | Following the task formulation in 3.1, we construct intent-augmented datasets, CCS-BF, CCS-MH, and CCS-AR, based on the query-context pairs in BASEFAKEPEDIA, MULTIHOPFAKEPEDIA (Monea et al., 2024), and ARITHMETIC. |
| Dataset Splits | Yes | Let Strn Q C and Stst Q C be disjoint training and testing sets of query context pairs. Models are trained on F (q, c, pri) a(q, ε) and F (q, c, ctx) a(q, c) for (q, c) Strn, where denotes concatenation. ... Training set size: 2048 examples. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing specifications used for running the experiments. |
| Software Dependencies | No | We build on pyvene (Wu et al., 2024) to train the projection. ... apply Py Torch s orthogonal parametrization4 to enforce orthonormal columns in A. The paper mentions `pyvene` and `PyTorch` but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | To fine-tune models in the CCS-BF task, we use QLo RA with the following hyperparameters: Effective batch size (after gradient accumulation): 16; Optimizer: Adam W (8-bit); Learning rate: 2e 4; QLo RA hyperparameters: attention head projection matrices in all layers; Training set size: 2048 examples. |