reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Authors: Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, Jacob Andreas

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to 6 higher accuracy compared to fine-tuned baselines reaching 53.0% on the public validation set with an 8B-parameter LM and 61.9% when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the 10-shot setting by 7.3 percentage points (50.5% to 57.8%).
Researcher Affiliation	Academia	1 Massachusetts Institute of Technology. Correspondence to: Ekin Akyurek <EMAIL>.
Pseudocode	No	The paper describes methods and processes through textual descriptions and diagrams (Figures 1, 2, 3, 4, 10, 11, 12, 14, 15, 16) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data are available at https://github.com/ekina kyurek/marc (ARC) and https://github.com/adamzweiger /Fewshot-TTT (BBH).
Open Datasets	Yes	An application of TTT to two challenging benchmark suites The Abstraction and Reasoning Corpus (ARC; Chollet, 2019) and BIG-Bench Hard (BBH; Srivastava et al., 2023; Suzgun et al., 2023).
Dataset Splits	Yes	We randomly pick 80 balanced ARC tasks from the ARC validation set, including 20 easy, 20 medium, 20 hard, 20 expert tasks according to the classification in (Le Gris et al., 2024)... For the 27 tasks in BBH, we consider the 10-shot setting, where we select 10 random pairs from each task s dataset to be demonstration pairs and evaluate on the remaining data.
Hardware Specification	Yes	We use 2x NVIDIA A100 GPU for 1B models, 4x NVIDIA A100 GPU for 3B and 8B models. We present hyperparameters in Table 4. ... With that, the whole TTT and inference process takes approximately 12 hours for 100 randomly sampled validation tasks when using an NVIDIA A100 GPU.
Software Dependencies	Yes	We perform full fine-tuning on LLama-3 family models by using the torchtune library. ... We similarly use the torchtune(torchtune Maintainers & Contributors, 2024) library for test-time training and the vLLM (Kwon et al., 2023) library for inference.
Experiment Setup	Yes	We present hyperparameters in Table 4. ...Table 4. ARC Initial Fine-tuning Hyperparameters Hyperparameter Search Space learning rate 2.5e-5 epochs 2 batch size 32 optimizer Adam W (Loshchilov & Hutter, 2018) scheduler Cosine LR Schedule with 2k warmup ...Table 6. ARC TTT Hyperparameters. ...Table 8. BBH TTT Fine-tuning Hyperparameters