reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controllable Context Sensitivity and the Knob Behind It

Authors: Julian Minder, Kevin Du, Niklas Stoehr, Giovanni Monea, Chris Wendler, Robert West, Ryan Cotterell

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When fine-tuned on this task, instruct versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85 95%). Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm.
Researcher Affiliation	Academia	DETH Zürich @EPFL NCornell University
Pseudocode	Yes	We provide Python-esque pseudocode for our search algorithm in App. A.1. Listing 1: Search Algorithm.
Open Source Code	Yes	We provide code to reproduce all datasets, experiments, and analysis at https://github.com/ kdu4108/context-vs-prior-finetuning.
Open Datasets	Yes	Following the task formulation in 3.1, we construct intent-augmented datasets, CCS-BF, CCS-MH, and CCS-AR, based on the query-context pairs in BASEFAKEPEDIA, MULTIHOPFAKEPEDIA (Monea et al., 2024), and ARITHMETIC.
Dataset Splits	Yes	Let Strn Q C and Stst Q C be disjoint training and testing sets of query context pairs. Models are trained on F (q, c, pri) a(q, ε) and F (q, c, ctx) a(q, c) for (q, c) Strn, where denotes concatenation. ... Training set size: 2048 examples.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing specifications used for running the experiments.
Software Dependencies	No	We build on pyvene (Wu et al., 2024) to train the projection. ... apply Py Torch s orthogonal parametrization4 to enforce orthonormal columns in A. The paper mentions `pyvene` and `PyTorch` but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	To fine-tune models in the CCS-BF task, we use QLo RA with the following hyperparameters: Effective batch size (after gradient accumulation): 16; Optimizer: Adam W (8-bit); Learning rate: 2e 4; QLo RA hyperparameters: attention head projection matrices in all layers; Training set size: 2048 examples.