reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test-time Correlation Alignment

Authors: Linjing You, Jiabao Lu, Xiayuan Huang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate our theoretical insights and show that TCA methods significantly outperforms baselines across various tasks, benchmarks and backbones. Notably, Linear TCA achieves higher accuracy with only 4% GPU memory and 0.6% computation time compared to the best TTA baseline.
Researcher Affiliation	Academia	1Institute of Automation, Chinese Academy of Sciences 2College of Science, Beijing Forestry University. Correspondence to: Xiayuan Huang <EMAIL>, Linjing You <EMAIL>.
Pseudocode	Yes	Algorithm 1 Linear TCA Algorithm
Open Source Code	Yes	Code: https://github.com/youlj109/TCA.
Open Datasets	Yes	Following previous studies, we evaluate the adaptation performance on two main tasks: domain generalization (PACS (Li et al., 2017), Office Home (Venkateswara et al., 2017), and Domain Net (Peng et al., 2019) dataset) and image corruption (CIFAR-10-C,CIFAR-100-C, and Image Net-C (Hendrycks & Dietterich, 2019)).
Dataset Splits	No	In the test-time adaptation (TTA) (Tan et al., 2024; Yuan et al., 2023) scenario, it has access only to unlabeled data from the test domain and a pre-trained model from the source domain. Specifically, let Ds = {(xi s, yi s)}ns i=1 Ds represent the labeled source domain dataset, where (xi s, yi s) is sampled i.i.d from the distribution Ds and ns is the number of the total source instances. The model, trained on the source domain dataset and parameterized by θ, is denoted as hθ( ) = g(f( )) : Xs Ys, where f( ) is the backbone encoder and g( ) denotes the decoder head. During testing, hθ( ) will perform well on in-distribution (ID) test instances drawn from Ds. However, given a set of outof-distribution (OOD) test instances Dt = {xi t}nt i=1 Dt and Dt = Ds, the prediction performance of hθ( ) would decrease significantly. To this end, the goal of TTA is to adapt this model hθ( ) to Dt without access to Ds. For the pre-trained model on Image Net-C dataset, we utilize the model provided by Torch Vision. The paper describes using known datasets but does not explicitly provide their specific splitting strategy for training, validation, and testing in their experiments.
Hardware Specification	No	Notably, Linear TCA achieves higher accuracy with only 4% GPU memory and 0.6% computation time compared to the best TTA baseline. In Figure 1b, we illustrate the computation time and maximum GPU memory usage of different TTA methods on the CIFAR-10-C dataset. The paper mentions 'GPU memory' and 'GPU memory usage' but does not specify the exact GPU model or any other hardware components like CPU or TPU models.
Software Dependencies	No	The features are generated using Py Torch and serve as synthetic examples. The paper mentions 'Py Torch' but does not provide a version number or list other software dependencies with their versions.
Experiment Setup	Yes	For the Linear TCA method, we optimized the number of pseudo-source instances k within the range {5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 200, 300}. For most datasets and backbones, smaller k values generally yield satisfactory results. For datasets with a substantial number of images per class, it is advisable to experiment with larger k values. For the Linear TCA+ method, we conducted an optimization of k values on the basis of other top-performing test-time adaptation method and its parameter settings. During the Test-Time Adaptation phase, both the Domain Generalization and Image Corruption tasks utilize specific batch size for different backbones. Res Net-18 and Res Net-50 use a batch size of 128, whereas the Vi T-B/16 is configured with a batch size of 64. For the Image Corruption task, we experiment with each TTA method using learning rates from {1e-7, 5e-7, 1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 1e-1} and the entropy filter hyperparameter in the set {1, 5, 10, 15, 20, 50, 100, 200, 300}.