Learning to Adapt Frozen CLIP for Few-Shot Test-Time Domain Adaptation

Authors: Zhixiang Chi, Li Gu, Huan Liu, Ziqiang Wang, Yanan Wu, Yang Wang, Konstantinos Plataniotis

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show our method s superiority on 5 large-scale benchmarks (WILDS and Domain Net), notably improving over smaller networks like Vi T-B/16 with gains of +5.1 in F1 for i Wild Cam and +3.1% in WC Acc for FMo W. Our Code: L2C. We conduct ablation studies on Domain Net-Info, i Wildcam and FMo W using CLIP Vi T-B/16 on various components, including CPNet, Revert Attention (RT), text refinement (Text ref.), greedy ensemble (Greedy), uniformity loss (Luni), DAF module and training schemes in Table 3.
Researcher Affiliation Academia Zhixiang Chi1, Li Gu2, Huan Liu3, Ziqiang Wang2, Yanan Wu2 , Yang Wang2 , Konstantinos N Plataniotis1 1 University of Toronto, 2 Concordia University, 3 Mc Master University Q EMAIL
Pseudocode Yes Algorithm 1 Domain-centric learning to adapt Require: I/T: CLIP image/text encoders; {Pp}P p=1: P text prompt templates; C: C classes with names; Ds: source domains; α: learning rate; CP: CPNet; K/V: K-V domain cache; DAF: domain-aware fusion module; Mc/Md: text refinement; 1: // Greedy text feature ensemble 2: {T(Pp C)}P p=1 Compute and sort text features for all text prompt templates 3: Obtain Tgre via greedy ensemble using Eq. 3, then discard the text encoder.
Open Source Code No Pre-trained models and the full code will be released upon publication of the paper.
Open Datasets Yes We follow VDPG to evaluate on Domain Net (Peng et al., 2019), which comprises 569K images across 345 classes in 6 domains. We also evaluate on 4 WILDS (Koh et al., 2021) benchmarks, known for their real-world challenges and notably low CLIP zero-shot accuracy (Chi et al., 2024).
Dataset Splits Yes We follow the official leave-one-domain-out protocol to train 6 models and report accuracy. For each iteration, we consider it as an adaptation task on a randomly sampled source domain Dn s . Two disjoint support set (x S) and query set (x Q, y Q) are sampled.
Hardware Specification Yes All the experiments can be conducted with a single NVIDIA V100 GPU.
Software Dependencies No Other components, such as CPNet, DAF, text refinement, and K-V cache, utilize standard Py Torch functions.
Experiment Setup Yes The model is trained for 20 epochs with SGD using cosine decay with initial learning rates of 2.5e 3 and 1e 3 for WILDS and Domain Net. λ is set to 0.1 to balance the losses. We use 16 images for adaptation at inference. Appendix G&H lists additional hyperparameters and the text prompts. We set the batch size as 64 (12 images for support and 52 images for query set).