Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
Authors: Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We focus our evaluations on fine-tuning at test-time for prompt-specific language modeling on the Pile dataset, and show that SIFT consistently outperforms Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we show that our uncertainty estimates can predict the performance gain of test-time fine-tuning, and use this to develop an adaptive algorithm that invests test-time compute proportional to realized performance gains. |
| Researcher Affiliation | Academia | Jonas H ubotter , Sascha Bongni, Ido Hakimi, Andreas Krause ETH Z urich, Switzerland |
| Pseudocode | Yes | Algorithm 1 SIFT(λ) Algorithm 2 SIFT-FAST(λ) Algorithm 3 SIFT-FAST(λ): RECOMPUTE Algorithm 4 SIFT-FAST(λ): UPDATESTATE |
| Open Source Code | Yes | We provide the activeft (Active Fine-Tuning) library which can be used as a drop-in replacement for Nearest Neighbor retrieval. |
| Open Datasets | Yes | We use the Pile dataset (Gao et al., 2020) for evaluation, restricting our use to data which is obtained and used in compliance with the terms of service of the data host. |
| Dataset Splits | Yes | We use the Pile training set containing 210M sequences of total size 1.3TB as data space for data selection, and we evaluate on the Pile test set.2 We evaluate on 1% of the test set (0.1% with Phi-3), corresponding to 1 812 sequences. |
| Hardware Specification | Yes | We report results with an NVIDIA RTX 4090 GPU in Figure 4.1 Results are with an NVIDIA GH200. |
| Software Dependencies | Yes | We use the Adam optimizer (Kingma & Ba, 2014) with ϵ-value 1e 8. We use the default learning rate 5e 5 of the transformers library (Wolf et al., 2020) unless noted otherwise. We use the standard implementation of the lm-evaluation-harness library (Gao et al., 2024) for computing the bits per byte. |
| Experiment Setup | Yes | We fine-tune a pre-trained LLM for a single gradient step each on N = 50 selected data points in the order that they are selected, most to least relevant. We use the default learning rate 5e 5 of the transformers library (Wolf et al., 2020) unless noted otherwise. We use Lo RAs with rank 64, output scaling 16, without dropout and bias. When fine-tuning with Lo RA, we use the learning rate 5e 4. We provide an overview of all hyperparameters of test-time fine-tuning in Table 9. |