reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test-Time Learning for Large Language Models

Authors: Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, Mingkui Tan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We establish the Adapt Eval benchmark for TTL and demonstrate through experiments that our TLM improves performance by at least 20% over original LLMs on domain knowledge adaptation. ... We compare our proposed TLM, the original LLM, Tent, EATA, and COME to demonstrate the superior performance of our method. We conduct experiments on different types of datasets, including Domain Bench, Instruction Bench, and Reasoning Bench, as summarized in Table 2 and 3. ... 5.3. Ablation Studies
Researcher Affiliation	Academia	1School of Software Engineering, South China University of Technology, China 2Pazhou Laboratory, China 3Zhejiang University, China 4South China Agricultural University, China. 5Chongqing University of Posts and Telecommunications, China 6Key Laboratory of Big Data and Intelligent Robot, Ministry of Education, China. Correspondence to: Mingkui Tan <EMAIL>, Yuanqing Li <EMAIL>, Bin Xiao <EMAIL>.
Pseudocode	Yes	Algorithm 1 The pipeline of proposed TLM. Input: Test samples DT est = {xj}M j=1, the trained LLM fΘ( ), Lo RA Θ with trainable parameters B and A, batch size B. 1: Initialize Lo RA parameters Θ. 2: Add Lo RA parameters to trained LLM Θ = Θ + Θ. 3: for a batch X = {xb}B b=1 in DT est do 4: Calculate predictions y for all x X via fΘ( ). 5: Calculate sample selection score S(x) via Eqn. (6). 6: Update LLM ( Θ) with Eqn.(5). 7: end for Output: The final LLM ( Θ).
Open Source Code	Yes	The source code is available at https://github.com/Fhujinwu/TLM
Open Datasets	Yes	To build a diverse and challenging evaluation framework, we collect high-quality datasets from Hugging Face, ensuring coverage across various data distributions. ... The Geo Signal1 dataset is a knowledge-intensive instruction-tuning resource tailored for the Earth Sciences domain... The Agriculture-QA2 dataset focuses on agricultural QA... Gen Med GPT-5k3 with a total of 5.45k samples is a medical dialogue dataset... The Wealth-Alpaca Lora4 dataset is focused on the financial domain... The Dolly-15k5 dataset, created by Databricks... The Alpaca-GPT47 dataset comprises 52k instruction-following samples generated using GPT-4... Instruction Wild8 is a large dataset focused on real-world user instructions... GSM8K9 is a high-quality dataset of linguistically diverse elementary school math word problems, constructed by Open AI... Meta Math10 is a large-scale dataset comprising approximately 395k samples... Logi QA11 is a high-quality, comprehensive dataset focused on logical reasoning...
Dataset Splits	Yes	From this dataset, we randomly select 5k samples to form the Geography dataset, which evaluates the model s domain knowledge and task performance in Geography. ... We randomly select 5k samples to create the Agriculture dataset... We randomly select 5k samples to create the Medicine dataset... We randomly select 5k samples to create the Finance dataset... A subset of 5k samples is randomly selected to evaluate model performance. ... From this dataset, we randomly select 5k samples to test the model s generalization capabilities... We randomly extract 5k samples to evaluate the model s ability to understand and execute instructions effectively. ... For evaluation, we combine the training and test sets and randomly select 5k samples. ... For evaluation purposes, we randomly select 5k samples from the training set. ... we create multiple-choice prompts and randomly select 5k samples for evaluation.
Hardware Specification	Yes	The training and evaluation are conducted on the 80G memory-sized NVIDIA A800 GPUs with CUDA version 12.1.
Software Dependencies	Yes	The training and evaluation are conducted on the 80G memory-sized NVIDIA A800 GPUs with CUDA version 12.1. Our method is implemented using the Py Torch framework with Pytorch, version 2.5.1. The training framework used is LLa MA-Factory.
Experiment Setup	Yes	We use Adam W as the update rule, with a batch size of 1 and the learning rate of 5e 5/ 5e 5/ 1e 6 for Domain Bench/ Instruction Bench/ Reasoning Bench. The λ and P0 in Eqn. 6 are set to 0.10 and e3. To improve the stability of outputs produced by LLMs, we apply greedy decoding with a temperature of 0 across all experiments.