Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Label Distribution Shift-Aware Prediction Refinement for Test-Time Adaptation
Authors: Minguk Jang, Hye Won Chung
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on CIFAR, PACS, Office Home, and Image Net benchmarks demonstrate DART s ability to correct inaccurate predictions caused by test-time distribution shifts. |
| Researcher Affiliation | Academia | Minguk Jang EMAIL School of Electrical Engineering, KAIST; Hye Won Chung EMAIL School of Electrical Engineering, KAIST |
| Pseudocode | Yes | The pseudocode for DART is presented in Appendix B. |
| Open Source Code | No | The paper mentions publicly released trained models and codes for baselines (e.g., "https://github.com/locuslab/tta_conjugate"), but does not explicitly state that the source code for their own method, DART, is publicly available or provide a link for it. |
| Open Datasets | Yes | We evaluate the effectiveness of DART across a wide range of TTA benchmarks, including CIFAR-10/100C, Image Net-C, CIFAR-10.1, PACS, and Office Home. For synthetic distribution shifts, we apply 15 different types of common corruption... CIFAR-10C (Hendrycks & Dietterich, 2019) serves as a benchmark... CIFAR-10.1 (Recht et al., 2018) is a newly collected test dataset for CIFAR-10... PACS benchmark consists of samples from seven classes... The Office Home (Venkateswara et al., 2017) benchmark is one of the well-known large-scale domain adaptation benchmarks... |
| Dataset Splits | Yes | To analyze the effects of label distribution shifts, we define the number of samples for class k as nk = n(1/Ļ)k/(K - 1)... We also consider CIFAR-100C-imb and Image Net-C-imb, whose label distributions keep changing during test time, as described in Section 2. These datasets are composed of K subsets, where K is the number of classes. We assume a class distribution of the k-th subset as [p1, p2, . . . , pK], where pk = pmax and pi = pmin = (1 - pmax)/(K - 1) for i = k. The imbalance ratio (IR) is defined as IR = pmax/pmin. Each subset consists of 100 samples from the CIFAR-100C and Image Net-C test set based on the above class distribution. |
| Hardware Specification | Yes | For instance, training gĻ during the intermediate time for CIFAR-10 takes only 7 minutes and 9 seconds on RTX A100. |
| Software Dependencies | No | The paper mentions "Adam optimizer (Kingma & Ba, 2014)" and "PyTorch library (Paszke et al., 2019)" but does not provide specific version numbers for PyTorch or other software dependencies required to replicate the experiments. |
| Experiment Setup | Yes | We use a 2-layer MLP (Haykin, 1998) for the prediction refinement module gĻ... The hidden dimension of the prediction refinement module is set to 1,000. During the intermediate time, we train gĻ with Adam optimizer (Kingma & Ba, 2014), a learning rate of 0.001, and cosine annealing for 50 epochs. For CIFAR-10/100 1, we train the model with 200 epochs, batch size 200, SGD optimizer, learning rate 0.1, momentum 0.9, and weight decay 0.0005. |