reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs

Authors: Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We focus our evaluations on fine-tuning at test-time for prompt-specific language modeling on the Pile dataset, and show that SIFT consistently outperforms Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we show that our uncertainty estimates can predict the performance gain of test-time fine-tuning, and use this to develop an adaptive algorithm that invests test-time compute proportional to realized performance gains.
Researcher Affiliation	Academia	Jonas H ubotter , Sascha Bongni, Ido Hakimi, Andreas Krause ETH Z urich, Switzerland
Pseudocode	Yes	Algorithm 1 SIFT(λ) Algorithm 2 SIFT-FAST(λ) Algorithm 3 SIFT-FAST(λ): RECOMPUTE Algorithm 4 SIFT-FAST(λ): UPDATESTATE
Open Source Code	Yes	We provide the activeft (Active Fine-Tuning) library which can be used as a drop-in replacement for Nearest Neighbor retrieval.
Open Datasets	Yes	We use the Pile dataset (Gao et al., 2020) for evaluation, restricting our use to data which is obtained and used in compliance with the terms of service of the data host.
Dataset Splits	Yes	We use the Pile training set containing 210M sequences of total size 1.3TB as data space for data selection, and we evaluate on the Pile test set.2 We evaluate on 1% of the test set (0.1% with Phi-3), corresponding to 1 812 sequences.
Hardware Specification	Yes	We report results with an NVIDIA RTX 4090 GPU in Figure 4.1 Results are with an NVIDIA GH200.
Software Dependencies	Yes	We use the Adam optimizer (Kingma & Ba, 2014) with ϵ-value 1e 8. We use the default learning rate 5e 5 of the transformers library (Wolf et al., 2020) unless noted otherwise. We use the standard implementation of the lm-evaluation-harness library (Gao et al., 2024) for computing the bits per byte.
Experiment Setup	Yes	We fine-tune a pre-trained LLM for a single gradient step each on N = 50 selected data points in the order that they are selected, most to least relevant. We use the default learning rate 5e 5 of the transformers library (Wolf et al., 2020) unless noted otherwise. We use Lo RAs with rank 64, output scaling 16, without dropout and bias. When fine-tuning with Lo RA, we use the learning rate 5e 4. We provide an overview of all hyperparameters of test-time fine-tuning in Table 9.