Cross-modal Collaborative Representation Learning for Text-to-Image Person Retrieval

Authors: Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple benchmarks demonstrate the superiority of Co RL over existing TIPR methods.
Researcher Affiliation Academia 1Nanjing University of Science and Technology 2Lancaster University 3Nanjing University of Aeronautics and Astronautics
Pseudocode No The paper includes figures illustrating the framework (e.g., Figure 1 and Figure 2) but does not contain any explicit pseudocode blocks or algorithms.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to any code repositories.
Open Datasets Yes The evaluations are conducted on three TIPR datasets. CUHK-PEDES [Li et al., 2017] has 40,206 images and 80,412 descriptions of 13,003 people. ICFG-PEDES [Ding et al., 2021] consists of 54,522 image-text pairs of 4,102 persons... RSTPReid [Zhu et al., 2021] includes 20,505 images of 4,101 people...
Dataset Splits Yes CUHK-PEDES [Li et al., 2017]... The dataset is split into 34,054 images for training, 3,078 for validation, and 3,074 for testing. ICFG-PEDES [Ding et al., 2021]... Training uses 34,674 pairs from 3,102 people, with the remaining 1,000 people reserved for evaluation. RSTPReid [Zhu et al., 2021]... Training includes 3,701 people, while validation and testing include 200 people each.
Hardware Specification Yes Experiments are implemented using the Py Torch library on a single NVIDIA RTX 3090 (24GB) GPU.
Software Dependencies No The paper mentions "Py Torch library" and "CLIP-ViT-B/16 as the backbone" but does not specify version numbers for these or any other software components.
Experiment Setup Yes Images are resized to 384 128 and augmented with random horizontal flipping, cropping with padding, and random erasing. The maximum length of the text sequence is set to 77, and random masking is employed for text augmentation. We use CLIP-Vi T-B/16 as the backbone. Temperature factors are set to τa = 0.02, τsp = 10, τwp = 5, and τn = 40. Loss weight λ1 is 0.1, and the boundaries α and β in IBM loss are 0.6 and 0.4. Each mini-batch comprises B = P K images, , with P = 32 identities and K = 4 images per identity. In the first stage, only a fully connected layer is optimized for 60 epochs using a cosine learning rate schedule, starting at 1 10 4. In the second stage, we fine-tune the visual/textual backbones with an initial learning rate of 1 10 5 and the Adapter with 5 10 5, also using a cosine schedule and trained for 60 epochs. Both stages adopt the Adam optimizer with a linear warm-up over the first 5 epochs.