The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Authors: Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, Jacob Andreas

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to 6 higher accuracy compared to fine-tuned baselines reaching 53.0% on the public validation set with an 8B-parameter LM and 61.9% when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the 10-shot setting by 7.3 percentage points (50.5% to 57.8%).
Researcher Affiliation Academia 1 Massachusetts Institute of Technology. Correspondence to: Ekin Akyurek <EMAIL>.
Pseudocode No The paper describes methods and processes through textual descriptions and diagrams (Figures 1, 2, 3, 4, 10, 11, 12, 14, 15, 16) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and data are available at https://github.com/ekina kyurek/marc (ARC) and https://github.com/adamzweiger /Fewshot-TTT (BBH).
Open Datasets Yes An application of TTT to two challenging benchmark suites The Abstraction and Reasoning Corpus (ARC; Chollet, 2019) and BIG-Bench Hard (BBH; Srivastava et al., 2023; Suzgun et al., 2023).
Dataset Splits Yes We randomly pick 80 balanced ARC tasks from the ARC validation set, including 20 easy, 20 medium, 20 hard, 20 expert tasks according to the classification in (Le Gris et al., 2024)... For the 27 tasks in BBH, we consider the 10-shot setting, where we select 10 random pairs from each task s dataset to be demonstration pairs and evaluate on the remaining data.
Hardware Specification Yes We use 2x NVIDIA A100 GPU for 1B models, 4x NVIDIA A100 GPU for 3B and 8B models. We present hyperparameters in Table 4. ... With that, the whole TTT and inference process takes approximately 12 hours for 100 randomly sampled validation tasks when using an NVIDIA A100 GPU.
Software Dependencies Yes We perform full fine-tuning on LLama-3 family models by using the torchtune library. ... We similarly use the torchtune(torchtune Maintainers & Contributors, 2024) library for test-time training and the vLLM (Kwon et al., 2023) library for inference.
Experiment Setup Yes We present hyperparameters in Table 4. ...Table 4. ARC Initial Fine-tuning Hyperparameters Hyperparameter Search Space learning rate 2.5e-5 epochs 2 batch size 32 optimizer Adam W (Loshchilov & Hutter, 2018) scheduler Cosine LR Schedule with 2k warmup ...Table 6. ARC TTT Hyperparameters. ...Table 8. BBH TTT Fine-tuning Hyperparameters