reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Label Distribution Shift-Aware Prediction Refinement for Test-Time Adaptation

Authors: Minguk Jang, Hye Won Chung

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on CIFAR, PACS, Office Home, and Image Net benchmarks demonstrate DART s ability to correct inaccurate predictions caused by test-time distribution shifts.
Researcher Affiliation	Academia	Minguk Jang EMAIL School of Electrical Engineering, KAIST; Hye Won Chung EMAIL School of Electrical Engineering, KAIST
Pseudocode	Yes	The pseudocode for DART is presented in Appendix B.
Open Source Code	No	The paper mentions publicly released trained models and codes for baselines (e.g., "https://github.com/locuslab/tta_conjugate"), but does not explicitly state that the source code for their own method, DART, is publicly available or provide a link for it.
Open Datasets	Yes	We evaluate the effectiveness of DART across a wide range of TTA benchmarks, including CIFAR-10/100C, Image Net-C, CIFAR-10.1, PACS, and Office Home. For synthetic distribution shifts, we apply 15 different types of common corruption... CIFAR-10C (Hendrycks & Dietterich, 2019) serves as a benchmark... CIFAR-10.1 (Recht et al., 2018) is a newly collected test dataset for CIFAR-10... PACS benchmark consists of samples from seven classes... The Office Home (Venkateswara et al., 2017) benchmark is one of the well-known large-scale domain adaptation benchmarks...
Dataset Splits	Yes	To analyze the effects of label distribution shifts, we define the number of samples for class k as nk = n(1/ρ)k/(K - 1)... We also consider CIFAR-100C-imb and Image Net-C-imb, whose label distributions keep changing during test time, as described in Section 2. These datasets are composed of K subsets, where K is the number of classes. We assume a class distribution of the k-th subset as [p1, p2, . . . , pK], where pk = pmax and pi = pmin = (1 - pmax)/(K - 1) for i = k. The imbalance ratio (IR) is defined as IR = pmax/pmin. Each subset consists of 100 samples from the CIFAR-100C and Image Net-C test set based on the above class distribution.
Hardware Specification	Yes	For instance, training gϕ during the intermediate time for CIFAR-10 takes only 7 minutes and 9 seconds on RTX A100.
Software Dependencies	No	The paper mentions "Adam optimizer (Kingma & Ba, 2014)" and "PyTorch library (Paszke et al., 2019)" but does not provide specific version numbers for PyTorch or other software dependencies required to replicate the experiments.
Experiment Setup	Yes	We use a 2-layer MLP (Haykin, 1998) for the prediction refinement module gϕ... The hidden dimension of the prediction refinement module is set to 1,000. During the intermediate time, we train gϕ with Adam optimizer (Kingma & Ba, 2014), a learning rate of 0.001, and cosine annealing for 50 epochs. For CIFAR-10/100 1, we train the model with 200 epochs, batch size 200, SGD optimizer, learning rate 0.1, momentum 0.9, and weight decay 0.0005.