reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Entropy: Region Confidence Proxy for Wild Test-Time Adaptation

Authors: Zixuan Hu, Yichun Hu, Xiaotong Li, Shixiang Tang, Lingyu Duan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments demonstrate the consistent superiority of Re CAP over existing methods across various datasets and wild scenarios. The source code will be available at https://github.com/hzcar/Re CAP.
Researcher Affiliation	Academia	1School of Computer Science, Peking University, Beijing, China 2Peng Cheng Laboratory, Shenzhen, China 3The Chinese University of Hong Kong, Hongkong, China. Correspondence to: Ling-Yu Duan <EMAIL>.
Pseudocode	No	The paper describes mathematical formulations and derivations, such as Lemma 4.1, 4.2, Proposition 4.3, and 4.4, and an 'Overall Procedure of Re CAP' describing the loss function. However, it does not contain any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	The source code will be available at https://github.com/hzcar/Re CAP.
Open Datasets	Yes	Datasets. We conduct our experiments on three datasets to evaluate the robustness and generalization capability of our method under diverse distribution shifts: 1) Image Net C (Hendrycks & Dietterich, 2019), a large-scale dataset categorized into 15 common corruption types and 5 severity levels for each type. 2) Image Net-R (Hendrycks et al., 2021) and 3) Vis DA-2021 (Bashkirova et al., 2022), two datasets which encompass diverse domain shifts due to varying styles and textures (e.g., sketch, cartoon), compared to Image Net-C to assess the efficacy for more challenging wild test scenarios in the Appendix B.
Dataset Splits	Yes	In this paper, we primarily evaluate the out-of-distribution (OOD) generalization ability of all methods using a widely adopted benchmark: Image Net-C (Hendrycks & Dietterich, 2019). Image Net-C is derived by applying a series of corruptions to the original Image Net (Deng et al., 2009) test set, making it a large-scale benchmark for assessing model robustness under real-world distribution shifts.
Hardware Specification	Yes	We assess TTA approaches for processing 50,000 images in Gaussian corruption type, using a single Nvidia RTX 4090 GPU.
Software Dependencies	No	The paper mentions software components like 'timm (Wightman, 2019)' and models like 'Res Net50-GN (Wu & He, 2018) and Vi TBase LN (Dosovitskiy et al., 2020)', but it does not provide specific version numbers for these or other software libraries/environments (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	For the optimizer, we use SGD, batch size of 64 (except for batch size=1), with a momentum of 0.9, and a learning rate of 0.00025/0.001 for Res Net/Vi T. For our Re CAP, L0 and τRE in Eq. 9 is set to 0.7/1.0 ln C and 0.8/1.0 ln C (C is the number of classes) for Res Net/Vi T. The hyper-parameter τ in Eq. 4 is 1.2 and λ in Eq. 9 is 0.5 by default. For trainable parameters, according to common practices (Wang et al., 2020), we adapt the affine parameters of normalization layers.