reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test-Time Training Provably Improves Transformers as In-context Learners

Authors: Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Xuechen Zhang, Mahdi Soltanolkotabi, Marco Mondelli, Samet Oymak

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As our empirical contribution, we study the benefits of TTT for Tab PFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.
Researcher Affiliation	Academia	1University of Michigan, Ann Arbor 2University of Southern California 3Institute of Science and Technology Austria.
Pseudocode	No	The following proposition (proved in Appendix A) characterizes the single-step GD TTT update. Proposition 3.1. Consider the linear attention model with parameters W Rd d. Suppose the test-time training loss function is defined as in (2) and define ucontext := Xcontext ycontext Rd. Then, for any step size η > 0, the new parameter WTT after one gradient-descent step from W is given by the rank-1 update WTT = W + 2 η Xtrain (ytrain Xtrain W ucontext) ucontext .
Open Source Code	No	No explicit statement about providing source code for their methodology is found, nor is a link to a code repository provided.
Open Datasets	Yes	Specifically, we evaluate the Tab PFN v2 model on the The Tremendous Tab Lib Trawl (T4) dataset (Gardner et al., 2024) for a more comprehensive evaluation, which is a large-scale high-quality collection of tabular benchmarks.
Dataset Splits	Yes	Following the official Tab PFN v2 implementation and our theoretical setup, we select the datasets containing at least 1,250 samples (with 1,000 for training, using an 80 20 split)
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) are provided for running the experiments. It only mentions using the GPT-2 architecture and Tab PFN v2 model.
Software Dependencies	No	The paper mentions "Tab PFN v2 (Hollmann et al., 2025)" and "GPT-2 architecture (Radford et al., 2019)" but does not provide specific version numbers for these or other software dependencies, such as programming languages or libraries.
Experiment Setup	Yes	For the Tab PFN (blue curve in Figure 3a), we directly load the pre-trained model and vary the context window length during evaluation. In contrast, for Tab PFN+TTT (orange curve in Figure 3a), we finetune the model using different context lengths with k = 1000 samples. As the context length n decreases, the samples are divided into 1000/n groups, where each group undergoes 50 training iterations. ... Setting: d = 60; n = 40; k changing between 64 and 512 in increments of 64; σ2 = 0.01; Σβ = diag(0.1 I25, 0.5 I10, I25); Σx = I. We sample βTT from the distribution N(0, ΣβTT) where ΣβTT = diag(I25, 0.5 I10, 0.1 I25).