Test-time Adaptation for Cross-modal Retrieval with Query Shift

Authors: Haobin Li, Peng Hu, Qianjun Zhang, Xi Peng, XitingLiu, Mouxing Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Expensive experiments demonstrate the effectiveness of the proposed TCR against query shift. Code is available at https://github.com/XLearning-SCU/2025-ICLR-TCR. Extensive experiments verify the effectiveness of the proposed method. Furthermore, we benchmark the existing TTA methods on cross-modal retrieval with query shift across six widely-used image-text datasets, hoping to facilitate the study of test-time adaptation beyond unimodal tasks.
Researcher Affiliation Academia College of Computer Science, Sichuan University, China.1 Southwest Jiaotong University, China.2 State Key Laboratory of Hydraulics and Mountain River Engineering, Sichuan University, China.3 Georgia Institute of Technology, USA.4
Pseudocode Yes B.4 PSEUDO CODE In the following, we provide the pseudo-code of the proposed TCR in Algorithm 1. To guarantee the stability of the estimation for Em and S, we maintain a queue which always saves the query-candidate pairs with the smallest SI during the adaptation process. Following Caron et al. (2020), we limit the queue updating times to a maximum of 10 iterations. Algorithm 1: Test-time adaptation for Cross-modal Retrieval (TCR)
Open Source Code Yes Code is available at https://github.com/XLearning-SCU/2025-ICLR-TCR.
Open Datasets Yes Following Qiu et al. (2023), we introduce 16 types of corruptions to the image modality and 15 types to the text modality across widely-used image-text retrieval datasets, COCO (Lin et al., 2014) and Flickr (Plummer et al., 2015). Fashion-Gen (Rostamzadeh et al., 2018) from the e-commerce domain, CUHK-PEDES (Li et al., 2017) and ICFG-PEDES (Ding et al., 2021) from the person re-identification (Re ID) domain, and COCO, Flickr, and Nocaps (Agrawal et al., 2019) from the natural image domain. We conduct additional experiments in the even rarer remote sensing domain. To this end, we choose the BLIP as the source model and perform zero-shot retrieval on the remote sensing datasets RSICD (Lu et al., 2017) and RSITMD (Yuan et al., 2022).
Dataset Splits Yes Following Lee et al. (2018), we adopt two testing protocols, namely, image-to-text retrieval (a.k.a. TR) and text-to-image retrieval (a.k.a. IR). COCO is a large-scale dataset for cross-modal retrieval and image captioning tasks. For evaluation, we conduct experiments on the COCO 2014 testing set following Li et al. (2022), which contains 5,000 images and 25,000 annotations, with each image associated with five corresponding text descriptions. Flickr is a cross-modal retrieval dataset collected from natural scenarios. Following Radford et al. (2021), we employ the test set comprising 1,000 images and 5,000 annotations, where each image is paired with five corresponding sentences. Nocaps is a cross-modal retrieval dataset derived from the Open Images dataset. For evaluation, we perform experiments on the test set, which consists of 648 in-domain images, 2,938 near-domain images, and 914 out-domain images. Each image is paired with 10 captions.
Hardware Specification No No specific hardware details (GPU/CPU models, memory) are provided in the main text.
Software Dependencies No The paper mentions the use of Adam W optimizer, but does not provide specific version numbers for software libraries or environments to allow for reproducibility.
Experiment Setup Yes During the adaptation process, TCR performs the objective function for each coming mini-batch of queries, and the batch size is set as 64. Following Niu et al. (2023); Wang et al. (2021), TCR updates the parameters within the normalization layers in the query-specific encoder fΘQ s using the Adam W optimizer. To be more specific, the learnable parameters in Θ (Eq. 3) correspond to the Layer Normalization (LN) layers in our implementation. Besides, the temperature hyper-parameter τ in Eq. 1 and uniformity learning hyper-parameter t in Eq. 9 are fixed as 0.02 and 10 for all experiments, respectively. Moreover, the adaptation process utilizes an initial learning rate of 3e-4/3e-5 for text/image retrieval, excepting 3e-4 for image retrieval on the CLIP model.