ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval

Authors: Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, Weili Guan

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four benchmark datasets demonstrate the superiority of our proposed method.
Researcher Affiliation Academia 1School of Software, Shandong University 2School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) 3School of Data Science, City University of Hong Kong EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Moreover, we have released our codes to facilitate other researchers1. 1https://sdu-l.github.io/ENCODER.github.io/
Open Datasets Yes Following previous works, we chose four benchmark datasets for evaluation, including three fashion-domain datasets, Fashion IQ (Wu et al. 2021), Shoes (Guo et al. 2018), Fashion200K (Han et al. 2017) and an open-domain dataset CIRR (Liu et al. 2021b).
Dataset Splits No The paper mentions evaluation metrics for different datasets (e.g., R@k for Shoes and Fashion200K, R@10, R@50 for Fashion IQ), but it does not explicitly state the training, validation, or test splits for these datasets. For example, it mentions a batch size but not the dataset split ratios or counts.
Hardware Specification Yes All experiments were conducted on a single NVIDIA Tesla T4 GPU with 16GB memory and trained 10 epochs.
Software Dependencies No ENCODER is built upon the pretrained CLIP (Radford et al. 2021) (Vi T-B/32 version). We trained ENCODER using the Adam W optimizer... The paper mentions CLIP and its version (ViT-B/32) and the Adam W optimizer, but it does not specify versions for other ancillary software like Python, PyTorch, or CUDA.
Experiment Setup Yes ENCODER is built upon the pretrained CLIP (Radford et al. 2021) (Vi T-B/32 version). We trained ENCODER using the Adam W optimizer with the initial learning rate of 5e-5, while the batch size is set to 128 and the learning rate for CLIP is 1e-6. Empirically, we maintained a consistent embedding dimension D of 512 throughout the network. We set the latent factor number P to 4 and the query number E of LRQ to 3. We also adopt the temperature factor τ to 0.1 for Eqn.(9,13,14). Through a comprehensive grid search, we set κ = 0.8, γ = 0.5, and µ = 0.5 for all four datasets. All experiments were conducted on a single NVIDIA Tesla T4 GPU with 16GB memory and trained 10 epochs.