Test-Time Training Provably Improves Transformers as In-context Learners

Authors: Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Xuechen Zhang, Mahdi Soltanolkotabi, Marco Mondelli, Samet Oymak

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As our empirical contribution, we study the benefits of TTT for Tab PFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.
Researcher Affiliation Academia 1University of Michigan, Ann Arbor 2University of Southern California 3Institute of Science and Technology Austria.
Pseudocode No The following proposition (proved in Appendix A) characterizes the single-step GD TTT update. Proposition 3.1. Consider the linear attention model with parameters W Rd d. Suppose the test-time training loss function is defined as in (2) and define ucontext := Xcontext ycontext Rd. Then, for any step size η > 0, the new parameter WTT after one gradient-descent step from W is given by the rank-1 update WTT = W + 2 η Xtrain (ytrain Xtrain W ucontext) ucontext .
Open Source Code No No explicit statement about providing source code for their methodology is found, nor is a link to a code repository provided.
Open Datasets Yes Specifically, we evaluate the Tab PFN v2 model on the The Tremendous Tab Lib Trawl (T4) dataset (Gardner et al., 2024) for a more comprehensive evaluation, which is a large-scale high-quality collection of tabular benchmarks.
Dataset Splits Yes Following the official Tab PFN v2 implementation and our theoretical setup, we select the datasets containing at least 1,250 samples (with 1,000 for training, using an 80 20 split)
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) are provided for running the experiments. It only mentions using the GPT-2 architecture and Tab PFN v2 model.
Software Dependencies No The paper mentions "Tab PFN v2 (Hollmann et al., 2025)" and "GPT-2 architecture (Radford et al., 2019)" but does not provide specific version numbers for these or other software dependencies, such as programming languages or libraries.
Experiment Setup Yes For the Tab PFN (blue curve in Figure 3a), we directly load the pre-trained model and vary the context window length during evaluation. In contrast, for Tab PFN+TTT (orange curve in Figure 3a), we finetune the model using different context lengths with k = 1000 samples. As the context length n decreases, the samples are divided into 1000/n groups, where each group undergoes 50 training iterations. ... Setting: d = 60; n = 40; k changing between 64 and 512 in increments of 64; σ2 = 0.01; Σβ = diag(0.1 I25, 0.5 I10, I25); Σx = I. We sample βTT from the distribution N(0, ΣβTT) where ΣβTT = diag(I25, 0.5 I10, 0.1 I25).