Test-Time Training Provably Improves Transformers as In-context Learners
Authors: Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Xuechen Zhang, Mahdi Soltanolkotabi, Marco Mondelli, Samet Oymak
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As our empirical contribution, we study the benefits of TTT for Tab PFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost. |
| Researcher Affiliation | Academia | 1University of Michigan, Ann Arbor 2University of Southern California 3Institute of Science and Technology Austria. |
| Pseudocode | No | The following proposition (proved in Appendix A) characterizes the single-step GD TTT update. Proposition 3.1. Consider the linear attention model with parameters W Rd d. Suppose the test-time training loss function is defined as in (2) and define ucontext := Xcontext ycontext Rd. Then, for any step size η > 0, the new parameter WTT after one gradient-descent step from W is given by the rank-1 update WTT = W + 2 η Xtrain (ytrain Xtrain W ucontext) ucontext . |
| Open Source Code | No | No explicit statement about providing source code for their methodology is found, nor is a link to a code repository provided. |
| Open Datasets | Yes | Specifically, we evaluate the Tab PFN v2 model on the The Tremendous Tab Lib Trawl (T4) dataset (Gardner et al., 2024) for a more comprehensive evaluation, which is a large-scale high-quality collection of tabular benchmarks. |
| Dataset Splits | Yes | Following the official Tab PFN v2 implementation and our theoretical setup, we select the datasets containing at least 1,250 samples (with 1,000 for training, using an 80 20 split) |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) are provided for running the experiments. It only mentions using the GPT-2 architecture and Tab PFN v2 model. |
| Software Dependencies | No | The paper mentions "Tab PFN v2 (Hollmann et al., 2025)" and "GPT-2 architecture (Radford et al., 2019)" but does not provide specific version numbers for these or other software dependencies, such as programming languages or libraries. |
| Experiment Setup | Yes | For the Tab PFN (blue curve in Figure 3a), we directly load the pre-trained model and vary the context window length during evaluation. In contrast, for Tab PFN+TTT (orange curve in Figure 3a), we finetune the model using different context lengths with k = 1000 samples. As the context length n decreases, the samples are divided into 1000/n groups, where each group undergoes 50 training iterations. ... Setting: d = 60; n = 40; k changing between 64 and 512 in increments of 64; σ2 = 0.01; Σβ = diag(0.1 I25, 0.5 I10, I25); Σx = I. We sample βTT from the distribution N(0, ΣβTT) where ΣβTT = diag(I25, 0.5 I10, 0.1 I25). |