reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Over-Certainty Phenomenon in Modern Test-Time Adaptation Algorithms

Authors: Fin Amin, Jung-Eun Kim

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In order to evaluate DEC, we conduct a series of experiments using three different backbone models across four datasets. Our primary evaluation metrics will be model accuracy, NLL and ECEbins=15 on the observations, allowing us to examine both the predictive performance and the calibration quality of the models.
Researcher Affiliation	Academia	Fin Amin EMAIL Department of Electrical and Computer Engineering North Carolina State University Jung-Eun Kim EMAIL Department of Computer Science North Carolina State University
Pseudocode	Yes	Algorithm 1 Compute Certainty Regularizer (CCR) ... Algorithm 2 Dynamic Entropy Control (DEC)
Open Source Code	Yes	In the interest of reproducibility, we release our code at https://github.com/Fin Amin Toast Crunch/Dynamic Entropy Control.
Open Datasets	Yes	The following publicly available TTA datasets are used in our experiments; we selected these because they are commonly used in existing works and provide a variety of domain shifts. 1. PACS Li et al. (2017) ... 2. Home Office Venkateswara et al. (2017) ... 3. Digits is a combination of 3 numbers datasets: USPS Hull (1994), MNIST Le Cun et al. (2010), and SVHN Netzer et al. (2011). ... 4. Tiny Image Net-C (TIN-C) Le & Yang (2015)
Dataset Splits	Yes	1. PACS Li et al. (2017) has 4 domains: pictures, art, cartoon, sketch with 7 classes. Tested using LOO. ... 2. Home Office Venkateswara et al. (2017) ... Tested using LOO. ... 3. Digits is a combination of 3 numbers datasets: USPS Hull (1994), MNIST Le Cun et al. (2010), and SVHN Netzer et al. (2011). ... Tested using LOO by training on the source domains training sets and adapting to target domain s test set. ... 4. Tiny Image Net-C (TIN-C) Le & Yang (2015), has 15 domains with 200 classes. ... Backbones are trained on corruption-free (source) training set, adapted to and evaluated on corrupted (target) domains. For each target domain, there are 5 tiers of corruption.
Hardware Specification	Yes	We used Tensor Flow 2.9 Abadi et al. (2015) with Nvidia CUDNN version 11.3 on an RTX 3080 16GB laptop GPU with 32GB of system memory.
Software Dependencies	Yes	We used Tensor Flow 2.9 Abadi et al. (2015) with Nvidia CUDNN version 11.3 on an RTX 3080 16GB laptop GPU with 32GB of system memory. ... We do most initial training on the source domain using RMS_Prop(lr = 2e 4) Tieleman et al. (2012) ... Small CNN is compiled and initially trained with the Adam optimizer Kingma & Ba (2014)
Experiment Setup	Yes	We do most initial training on the source domain using RMS_Prop(lr = 2e 4) Tieleman et al. (2012) to minimize cross-entropy loss for epochs = {15, 15, 5, 25} for each enumerated dataset, respectively. ... For ETA, we set E_0 = 0.4 ln (C), as this was their recommended value, and ϵ = {0.6, 0.1, 0.4, 0.125} for each enumerated dataset, respectively. ... For T3A, we set the number of supports to retain, M = , as this provides the lowest calibration error. For So TTA, we set ρ = 0.05, C0 = {0.33, 0.33, 0.33, 0.66} for each dataset respectively to help their performance, and NSo T T A = 64 as per their recommendations. We use a batch size of 50 for our DEC for all experiments. ... All images are resized to (227, 227, 3) and scaled between [0, 255]. ... Mobile Net: We set tmin and tmax parameters to 1.20 and 2.75 respectively. ... All experiments are run three times using random_seed = 0, 1, 2, respectively.