reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Noisy Test-Time Adaptation in Vision-Language Models

Authors: Chentao Cao, Zhun Zhong, (Andrew) Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, Bo Han

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments to analyze this phenomenon, revealing that the negative impact of unfiltered noisy data outweighs the benefits of clean data during model updating. Extensive experiments show that our method outperforms in both ZS-NTTA and ZS-OOD detection. On Image Net, Ada ND achieves a notable improvement of 8.32% in harmonic mean accuracy (Acc H) for ZS-NTTA and 9.40% in FPR95 for ZS-OOD detection, compared to state-of-the-art methods.
Researcher Affiliation	Academia	1TMLR Group, Department of Computer Science, Hong Kong Baptist University 2School of Computer Science and Information Engineering, Hefei University of Technology 3Sydney AI Centre, The University of Sydney 4Computer Science and Engineering, University of California, Santa Cruz 5Mohamed bin Zayed University of Artificial Intelligence 6Carnegie Mellon University Correspondence to Bo Han (EMAIL) and Zhun Zhong (EMAIL).
Pseudocode	Yes	REPRODUCIBILITY STATEMENT We provide details to reproduce our results in Sec. 2, Sec. 5.1, and Sec. D. We also provide pseudo-code in Algorithm 1, and the code is publicly available at: https://github.com/ tmlr-group/ZS-NTTA.
Open Source Code	Yes	The code is publicly available at: https://github.com/tmlr-group/ZS-NTTA.
Open Datasets	Yes	The ID datasets include CIFAR-10/100 (Krizhevsky et al., 2009), CUB-200-2011 (Wah et al., 2011), STANFORD-CARS (Krause et al., 2013), Food-101 (Bossard et al., 2014), Oxford-IIIT Pet (Parkhi et al., 2012), Image Net (Deng et al., 2009), Image Net-V2 (Recht et al., 2019), Image Net-A (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019). The OOD datasets encompass SVHN (Netzer et al., 2011), LSUN (Yu et al., 2015), i Naturalist (Van Horn et al., 2018), SUN (Xiao et al., 2010), Places (Zhou et al., 2017), and Texture (Cimpoi et al., 2014).
Dataset Splits	Yes	To prevent overlap in label spaces of noisy and clean samples, we use established ID-OOD dataset3 pairs from standard OOD detection benchmarks. The ID datasets include CIFAR-10/100 (Krizhevsky et al., 2009), CUB-200-2011 (Wah et al., 2011), STANFORD-CARS (Krause et al., 2013), Food-101 (Bossard et al., 2014), Oxford-IIIT Pet (Parkhi et al., 2012), Image Net (Deng et al., 2009), Image Net-V2 (Recht et al., 2019), Image Net-A (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019). The OOD datasets encompass SVHN (Netzer et al., 2011), LSUN (Yu et al., 2015), i Naturalist (Van Horn et al., 2018), SUN (Xiao et al., 2010), Places (Zhou et al., 2017), and Texture (Cimpoi et al., 2014). The specific ID-OOD pairs are detailed in Table 8 in Appendix D.1.
Hardware Specification	Yes	D.3 Environment The experiments presented in this paper are conducted utilizing Py Torch 1.13 (Paszke et al., 2019) and Python 3.10.8 within an Ubuntu 22.04 LTS environment, running on NVIDIA A100 80GB PCIe GPUs and AMD EPYC 7H12 CPU.
Software Dependencies	Yes	D.3 Environment The experiments presented in this paper are conducted utilizing Py Torch 1.13 (Paszke et al., 2019) and Python 3.10.8 within an Ubuntu 22.04 LTS environment, running on NVIDIA A100 80GB PCIe GPUs and AMD EPYC 7H12 CPU.
Experiment Setup	Yes	Ada ND Setups. In our main results, we maintain consistent hyper-parameters across all datasets. Specifically, we use CLIP (Radford et al., 2021) as our VLM, with Vi T-B/16 (Dosovitskiy et al., 2020) as the image encoder and masked self-attention Transformer (Vaswani et al., 2017) as the text encoder, both keeping frozen. We employ a single linear layer as our noise detector, which remains learnable throughout the TTA process. We optimize with Adam (Kingma & Ba, 2014), using a learning rate of 0.0005 and no weight decay. Gaussian noise is injected every 8 samples (M = 8). The noise detector s queue length (L) is set to 128, and the adaptive threshold s queue length (Nq) follows OWTTT (Li et al., 2023b) with a value of 512. We use N = 10 for the first stage. As for the ZS-OOD detection task, we use MCM (Ming et al., 2022) score from the output logit of the noise detector as our score function. Unless otherwise specified, we set the batch size (bs) to 1 for Ada ND.