reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models

Authors: Zhaohong Huang, Yuxin Zhang, Jingjing Xie, Fei Chao, Rongrong Ji

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our code is released at https://github.com/hzhxmu/GS-Bias. [...] We report comprehensive evaluations and comparisons to the existing TTA method for VLMs across 15 datasets, demonstrating that GS-Bias achieves an excellent balance between performance and efficiency.
Researcher Affiliation	Academia	1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. Correspondence to: Rongrong Ji <EMAIL>.
Pseudocode	No	The paper describes methods and equations but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is released at https://github.com/hzhxmu/GS-Bias.
Open Datasets	Yes	For the Cross-Datasets Generalization benchmark, we assess performance on 10 diverse classification datasets, covering a broad spectrum of visual recognition tasks. These include datasets for plant and animal species (Flowers102 (Nilsback & Zisserman, 2008) and Oxford Pets (Parkhi et al., 2012)), transportation (Stanford Cars (Krause et al., 2013) and FGVC-Aircraft (Maji et al., 2013)), food (Food101 (Bossard et al., 2014)), satellite imagery (Euro SAT (Helber et al., 2019)), human actions (UCF101 (Soomro, 2012)), texture (DTD (Cimpoi et al., 2014)), scene recognition (SUN397 (Sun et al., 2020)), and general object classification (Caltech101 (Fei-Fei et al., 2004)). In the Domain Generalization setting, we evaluate our approach on four out-of-distribution (OOD) variants of Image Net (Deng et al., 2009): Image Net V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a).
Dataset Splits	No	The paper mentions using standard benchmark datasets for evaluation and provides details on how augmented views are generated for test-time adaptation (e.g., BS = 8, ρ = 0.5), but it does not specify explicit training/validation/test splits of the datasets themselves as would be needed for reproducing model training.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA A800 GPU.
Software Dependencies	No	The paper mentions using CLIP, ViT, and Transformer-based text encoder but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Implementation details. In all experiments, we use the publicly available pre-trained CLIP model, with Vi T-B/16 (Dosovitskiy, 2020) as the backbone and a Transformer-based text encoder (Vaswani, 2017). For each test image, we evaluate two hand-crafted prompts: the basic prompt "a photo of a [class]." and the more elaborate ensemble prompt described in (Radford et al., 2021a). [...] Aligned with prior TTA works (Shu et al., 2022; Zanella & Ben Ayed, 2024), we adopt random cropping as the data augmentation strategy. Specifically, for the unseen cross-dataset generalization, we obtain a set of augmented views with BS = 8 and set the views selection rate to ρ = 0.5. For the out-of-distribution domain generalization, the number of augmented views is increased to BS = 64, and the views selection rate is adjusted to ρ = 0.3. For the learning of GS-Bias, the number of important spatial regions K in Eq.12 is fixed at 16. These biases are optimized with 5 steps during test phase. The learning rates for the biases in Eq. 7 and Eq. 14 are set to α = 1 and β = 1 for cross-domain generalization, whereas for domain generalization, α = 10 and β = 1. Across all experiments, top-1 accuracy (%) is used as the evaluation metric, which is a standard measure for classification performance.