Noisy Test-Time Adaptation in Vision-Language Models
Authors: Chentao Cao, Zhun Zhong, (Andrew) Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, Bo Han
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments to analyze this phenomenon, revealing that the negative impact of unfiltered noisy data outweighs the benefits of clean data during model updating. Extensive experiments show that our method outperforms in both ZS-NTTA and ZS-OOD detection. On Image Net, Ada ND achieves a notable improvement of 8.32% in harmonic mean accuracy (Acc H) for ZS-NTTA and 9.40% in FPR95 for ZS-OOD detection, compared to state-of-the-art methods. |
| Researcher Affiliation | Academia | 1TMLR Group, Department of Computer Science, Hong Kong Baptist University 2School of Computer Science and Information Engineering, Hefei University of Technology 3Sydney AI Centre, The University of Sydney 4Computer Science and Engineering, University of California, Santa Cruz 5Mohamed bin Zayed University of Artificial Intelligence 6Carnegie Mellon University Correspondence to Bo Han (EMAIL) and Zhun Zhong (EMAIL). |
| Pseudocode | Yes | REPRODUCIBILITY STATEMENT We provide details to reproduce our results in Sec. 2, Sec. 5.1, and Sec. D. We also provide pseudo-code in Algorithm 1, and the code is publicly available at: https://github.com/ tmlr-group/ZS-NTTA. |
| Open Source Code | Yes | The code is publicly available at: https://github.com/tmlr-group/ZS-NTTA. |
| Open Datasets | Yes | The ID datasets include CIFAR-10/100 (Krizhevsky et al., 2009), CUB-200-2011 (Wah et al., 2011), STANFORD-CARS (Krause et al., 2013), Food-101 (Bossard et al., 2014), Oxford-IIIT Pet (Parkhi et al., 2012), Image Net (Deng et al., 2009), Image Net-V2 (Recht et al., 2019), Image Net-A (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019). The OOD datasets encompass SVHN (Netzer et al., 2011), LSUN (Yu et al., 2015), i Naturalist (Van Horn et al., 2018), SUN (Xiao et al., 2010), Places (Zhou et al., 2017), and Texture (Cimpoi et al., 2014). |
| Dataset Splits | Yes | To prevent overlap in label spaces of noisy and clean samples, we use established ID-OOD dataset3 pairs from standard OOD detection benchmarks. The ID datasets include CIFAR-10/100 (Krizhevsky et al., 2009), CUB-200-2011 (Wah et al., 2011), STANFORD-CARS (Krause et al., 2013), Food-101 (Bossard et al., 2014), Oxford-IIIT Pet (Parkhi et al., 2012), Image Net (Deng et al., 2009), Image Net-V2 (Recht et al., 2019), Image Net-A (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019). The OOD datasets encompass SVHN (Netzer et al., 2011), LSUN (Yu et al., 2015), i Naturalist (Van Horn et al., 2018), SUN (Xiao et al., 2010), Places (Zhou et al., 2017), and Texture (Cimpoi et al., 2014). The specific ID-OOD pairs are detailed in Table 8 in Appendix D.1. |
| Hardware Specification | Yes | D.3 Environment The experiments presented in this paper are conducted utilizing Py Torch 1.13 (Paszke et al., 2019) and Python 3.10.8 within an Ubuntu 22.04 LTS environment, running on NVIDIA A100 80GB PCIe GPUs and AMD EPYC 7H12 CPU. |
| Software Dependencies | Yes | D.3 Environment The experiments presented in this paper are conducted utilizing Py Torch 1.13 (Paszke et al., 2019) and Python 3.10.8 within an Ubuntu 22.04 LTS environment, running on NVIDIA A100 80GB PCIe GPUs and AMD EPYC 7H12 CPU. |
| Experiment Setup | Yes | Ada ND Setups. In our main results, we maintain consistent hyper-parameters across all datasets. Specifically, we use CLIP (Radford et al., 2021) as our VLM, with Vi T-B/16 (Dosovitskiy et al., 2020) as the image encoder and masked self-attention Transformer (Vaswani et al., 2017) as the text encoder, both keeping frozen. We employ a single linear layer as our noise detector, which remains learnable throughout the TTA process. We optimize with Adam (Kingma & Ba, 2014), using a learning rate of 0.0005 and no weight decay. Gaussian noise is injected every 8 samples (M = 8). The noise detector s queue length (L) is set to 128, and the adaptive threshold s queue length (Nq) follows OWTTT (Li et al., 2023b) with a value of 512. We use N = 10 for the first stage. As for the ZS-OOD detection task, we use MCM (Ming et al., 2022) score from the output logit of the noise detector as our score function. Unless otherwise specified, we set the batch size (bs) to 1 for Ada ND. |