reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cross-modal Collaborative Representation Learning for Text-to-Image Person Retrieval

Authors: Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple benchmarks demonstrate the superiority of Co RL over existing TIPR methods.
Researcher Affiliation	Academia	1Nanjing University of Science and Technology 2Lancaster University 3Nanjing University of Aeronautics and Astronautics
Pseudocode	No	The paper includes figures illustrating the framework (e.g., Figure 1 and Figure 2) but does not contain any explicit pseudocode blocks or algorithms.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code, nor does it provide links to any code repositories.
Open Datasets	Yes	The evaluations are conducted on three TIPR datasets. CUHK-PEDES [Li et al., 2017] has 40,206 images and 80,412 descriptions of 13,003 people. ICFG-PEDES [Ding et al., 2021] consists of 54,522 image-text pairs of 4,102 persons... RSTPReid [Zhu et al., 2021] includes 20,505 images of 4,101 people...
Dataset Splits	Yes	CUHK-PEDES [Li et al., 2017]... The dataset is split into 34,054 images for training, 3,078 for validation, and 3,074 for testing. ICFG-PEDES [Ding et al., 2021]... Training uses 34,674 pairs from 3,102 people, with the remaining 1,000 people reserved for evaluation. RSTPReid [Zhu et al., 2021]... Training includes 3,701 people, while validation and testing include 200 people each.
Hardware Specification	Yes	Experiments are implemented using the Py Torch library on a single NVIDIA RTX 3090 (24GB) GPU.
Software Dependencies	No	The paper mentions "Py Torch library" and "CLIP-ViT-B/16 as the backbone" but does not specify version numbers for these or any other software components.
Experiment Setup	Yes	Images are resized to 384 128 and augmented with random horizontal flipping, cropping with padding, and random erasing. The maximum length of the text sequence is set to 77, and random masking is employed for text augmentation. We use CLIP-Vi T-B/16 as the backbone. Temperature factors are set to τa = 0.02, τsp = 10, τwp = 5, and τn = 40. Loss weight λ1 is 0.1, and the boundaries α and β in IBM loss are 0.6 and 0.4. Each mini-batch comprises B = P K images, , with P = 32 identities and K = 4 images per identity. In the first stage, only a fully connected layer is optimized for 60 epochs using a cosine learning rate schedule, starting at 1 10 4. In the second stage, we fine-tune the visual/textual backbones with an initial learning rate of 1 10 5 and the Adapter with 5 10 5, also using a cosine schedule and trained for 60 epochs. Both stages adopt the Adam optimizer with a linear warm-up over the first 5 epochs.