GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models

Authors: Zhaohong Huang, Yuxin Zhang, Jingjing Xie, Fei Chao, Rongrong Ji

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our code is released at https://github.com/hzhxmu/GS-Bias. [...] We report comprehensive evaluations and comparisons to the existing TTA method for VLMs across 15 datasets, demonstrating that GS-Bias achieves an excellent balance between performance and efficiency.
Researcher Affiliation Academia 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. Correspondence to: Rongrong Ji <EMAIL>.
Pseudocode No The paper describes methods and equations but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is released at https://github.com/hzhxmu/GS-Bias.
Open Datasets Yes For the Cross-Datasets Generalization benchmark, we assess performance on 10 diverse classification datasets, covering a broad spectrum of visual recognition tasks. These include datasets for plant and animal species (Flowers102 (Nilsback & Zisserman, 2008) and Oxford Pets (Parkhi et al., 2012)), transportation (Stanford Cars (Krause et al., 2013) and FGVC-Aircraft (Maji et al., 2013)), food (Food101 (Bossard et al., 2014)), satellite imagery (Euro SAT (Helber et al., 2019)), human actions (UCF101 (Soomro, 2012)), texture (DTD (Cimpoi et al., 2014)), scene recognition (SUN397 (Sun et al., 2020)), and general object classification (Caltech101 (Fei-Fei et al., 2004)). In the Domain Generalization setting, we evaluate our approach on four out-of-distribution (OOD) variants of Image Net (Deng et al., 2009): Image Net V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a).
Dataset Splits No The paper mentions using standard benchmark datasets for evaluation and provides details on how augmented views are generated for test-time adaptation (e.g., BS = 8, ρ = 0.5), but it does not specify explicit training/validation/test splits of the datasets themselves as would be needed for reproducing model training.
Hardware Specification Yes All experiments are conducted on a single NVIDIA A800 GPU.
Software Dependencies No The paper mentions using CLIP, ViT, and Transformer-based text encoder but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Implementation details. In all experiments, we use the publicly available pre-trained CLIP model, with Vi T-B/16 (Dosovitskiy, 2020) as the backbone and a Transformer-based text encoder (Vaswani, 2017). For each test image, we evaluate two hand-crafted prompts: the basic prompt "a photo of a [class]." and the more elaborate ensemble prompt described in (Radford et al., 2021a). [...] Aligned with prior TTA works (Shu et al., 2022; Zanella & Ben Ayed, 2024), we adopt random cropping as the data augmentation strategy. Specifically, for the unseen cross-dataset generalization, we obtain a set of augmented views with BS = 8 and set the views selection rate to ρ = 0.5. For the out-of-distribution domain generalization, the number of augmented views is increased to BS = 64, and the views selection rate is adjusted to ρ = 0.3. For the learning of GS-Bias, the number of important spatial regions K in Eq.12 is fixed at 16. These biases are optimized with 5 steps during test phase. The learning rates for the biases in Eq. 7 and Eq. 14 are set to α = 1 and β = 1 for cross-domain generalization, whereas for domain generalization, α = 10 and β = 1. Across all experiments, top-1 accuracy (%) is used as the evaluation metric, which is a standard measure for classification performance.