reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Authors: Felipe Pinto Coelho Nuti, Tim Franzmeyer, Joao F. Henriques

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we find that one can steer model behavior and performance by upor down-scaling the fine-tuning component during the forward pass. [...] We empirically validate that Tu Co is indeed much lower for pre-training-like inputs from the Open Web Text dataset [...] We then investigate how three prominent jailbreaking techniques affect the Tuning Contribution. [...] We compute the Tuning Contribution as described in Algorithm 1. We explain all experiments in more detail in the Appendix and make all code available publicly.
Researcher Affiliation	Academia	1University of Oxford. Correspondence to: Felipe Nuti <EMAIL>.
Pseudocode	Yes	Algorithm 1 Computation of Tuning Contribution (Tu Co) Input: Pre-trained model T PT φ , Fine-Tuned model T FT Θ , prompt s x0 Embed(Tokenizer(s)) {Tokenize and embed prompt} IFTC, IPTC 0 {Initialize cumulative contributions} for l = 0 to L 1 do PTCl f PT φ (xl, l) {Compute PTC for layer l} FTCl f FT Θ (xl, l) PTCl {Compute FTC for layer l} xl+1 xl + PTCl + FTCl {Update x for next layer} IFTC IFTC + FTCl[ 1] {Accumulate last-token FTC} IPTC IPTC + PTCl[ 1] {Accumulate last-token PTC} end for Tu Co IFTC IPTC + IFTC {Compute Tu Co} Return: Tu Co
Open Source Code	Yes	2Code is available at http://github.com/Felipe Nuti/tuning-contribution.
Open Datasets	Yes	Empirically, we also find that scaling the magnitude of the fine-tuning component controls model behaviors and capabilities. Specifically, tuning of the FTC results in as much as 5% test-set performance improvements for tasks of the MMLU benchmark (Hendrycks et al., 2020). [...] We empirically validate that Tu Co is indeed much lower for pre-training-like inputs from the Open Web Text dataset (Gokaslan & Cohen, 2019) than for chat-like inputs from a dataset designed for harmless and helpful model behavior (Bai et al., 2022a; Ganguli et al., 2022). [...] We construct a dataset consisting of the harmful instructions from the Adv Bench benchmark (Zou et al., 2023b) in English, Japanese, Hungarian, Swahili and Malayalam.
Dataset Splits	Yes	We use 5-fold cross-validation, and report the change in out-of-sample average accuracy CV(D), averaged across folds of a dataset D. [...] To evaluate how much we can increase model accuracy by choosing α appropriately, we first evenly divide D into K = 5 folds D1, , DK.
Hardware Specification	No	The paper mentions 'open-source models of up to 13B parameters' and 'GPU memory and running time constraints' in relation to MMLU tasks, but does not provide specific details on the GPU models, CPU models, or other hardware specifications used for their experiments.
Software Dependencies	No	The paper does not explicitly list software dependencies with specific version numbers.
Experiment Setup	Yes	We modulate the magnitude of the fine-tuning component FTC throughout the forward pass, and study to what extent model performance and behavior can be controlled via this modulation. [...] We evaluate the impact of scaling α between 0.75 and 1.25 on model outputs [...] We next optimize accuracy for each task and behavior using a grid search for α [0.75, 0.9, 0.95, 1.0, 1.05, 1.1, 1.25].