reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval

Authors: Jaehyun Kwak, Ramahdani Muhammad Izaaz Inhar, Se-Young Yun, Sung-Ju Lee

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that QURE achieves stateof-the-art performance on Fashion IQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-Fashion IQ dataset.
Researcher Affiliation	Academia	1KAIST. Correspondence to: Sung-Ju Lee <EMAIL>.
Pseudocode	Yes	Algorithm 1 Training Flow of QURE
Open Source Code	Yes	The source code is available at https: //github.com/jackwaky/Qu Re.
Open Datasets	Yes	We evaluate the models on widely used CIR datasets, Fashion IQ (Wu et al., 2021) and CIRR (Suhr et al., 2018), to assess their ability to retrieve the target image.
Dataset Splits	Yes	We evaluate the models on widely used CIR datasets, Fashion IQ (Wu et al., 2021) and CIRR (Suhr et al., 2018), to assess their ability to retrieve the target image. Additionally, we evaluate them on the HP-Fashion IQ dataset to assess their alignment with human preferences. ... We selected the Fashion IQ dataset for its high relevance and broad applicability, mirroring the search functionalities of e-commerce platforms.
Hardware Specification	Yes	All experiments were conducted using a single Nvidia RTX 3090 GPU.
Software Dependencies	No	No specific software dependencies with version numbers are mentioned in the paper, beyond the use of BLIP-2 as a backbone model and AdamW optimizer.
Experiment Setup	Yes	QURE is trained using the Adam W optimizer (Loshchilov, 2017) for 50 epochs on CIRR and 30 epochs on Fashion IQ. The hard negative set H was defined ndef times, starting with a warm-up phase where H initially included the entire corpus except for the target during the first nepoch/ndef epochs. The hard negative set H is updated every nepoch/ndef epochs. We set ndef to six for both Fashion IQ and CIRR. ... We resized images to 224 224 with a 1.25 padding ratio.