Learning to Adapt Frozen CLIP for Few-Shot Test-Time Domain Adaptation
Authors: Zhixiang Chi, Li Gu, Huan Liu, Ziqiang Wang, Yanan Wu, Yang Wang, Konstantinos Plataniotis
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show our method s superiority on 5 large-scale benchmarks (WILDS and Domain Net), notably improving over smaller networks like Vi T-B/16 with gains of +5.1 in F1 for i Wild Cam and +3.1% in WC Acc for FMo W. Our Code: L2C. We conduct ablation studies on Domain Net-Info, i Wildcam and FMo W using CLIP Vi T-B/16 on various components, including CPNet, Revert Attention (RT), text refinement (Text ref.), greedy ensemble (Greedy), uniformity loss (Luni), DAF module and training schemes in Table 3. |
| Researcher Affiliation | Academia | Zhixiang Chi1, Li Gu2, Huan Liu3, Ziqiang Wang2, Yanan Wu2 , Yang Wang2 , Konstantinos N Plataniotis1 1 University of Toronto, 2 Concordia University, 3 Mc Master University Q EMAIL |
| Pseudocode | Yes | Algorithm 1 Domain-centric learning to adapt Require: I/T: CLIP image/text encoders; {Pp}P p=1: P text prompt templates; C: C classes with names; Ds: source domains; α: learning rate; CP: CPNet; K/V: K-V domain cache; DAF: domain-aware fusion module; Mc/Md: text refinement; 1: // Greedy text feature ensemble 2: {T(Pp C)}P p=1 Compute and sort text features for all text prompt templates 3: Obtain Tgre via greedy ensemble using Eq. 3, then discard the text encoder. |
| Open Source Code | No | Pre-trained models and the full code will be released upon publication of the paper. |
| Open Datasets | Yes | We follow VDPG to evaluate on Domain Net (Peng et al., 2019), which comprises 569K images across 345 classes in 6 domains. We also evaluate on 4 WILDS (Koh et al., 2021) benchmarks, known for their real-world challenges and notably low CLIP zero-shot accuracy (Chi et al., 2024). |
| Dataset Splits | Yes | We follow the official leave-one-domain-out protocol to train 6 models and report accuracy. For each iteration, we consider it as an adaptation task on a randomly sampled source domain Dn s . Two disjoint support set (x S) and query set (x Q, y Q) are sampled. |
| Hardware Specification | Yes | All the experiments can be conducted with a single NVIDIA V100 GPU. |
| Software Dependencies | No | Other components, such as CPNet, DAF, text refinement, and K-V cache, utilize standard Py Torch functions. |
| Experiment Setup | Yes | The model is trained for 20 epochs with SGD using cosine decay with initial learning rates of 2.5e 3 and 1e 3 for WILDS and Domain Net. λ is set to 0.1 to balance the losses. We use 16 images for adaptation at inference. Appendix G&H lists additional hyperparameters and the text prompts. We set the batch size as 64 (12 images for support and 52 images for query set). |