Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Authors: Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Compo Diff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including Fashion IQ, CIRR, CIRCO, and Gene CIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions.
Researcher Affiliation Industry Geonmo Gu , 1, Sanghyuk Chun , 2, Wonjae Kim2, Hee Jae Jun1, Yoohoon Kang1, Sangdoo Yun2 1NAVER Vision 2NAVER AI Lab Equal contribution
Pseudocode No The paper describes methods through textual explanations and mathematical equations, accompanied by architectural diagrams (Fig. 3 and 4), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The code and dataset are available at https://github.com/navervision/Compo Diff.
Open Datasets Yes This paper also introduces a new synthetic dataset, named Synth Triplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. The code and dataset are available at https://github.com/navervision/Compo Diff. ... we evaluate the models on the zero-shot (ZS) CIR scenario using four CIR benchmarks: Fashion IQ (Wu et al., 2021), CIRR (Liu et al., 2021), CIRCO (Baldrati et al., 2023), and Gene CIS (Vaze et al., 2023); i.e., we report the retrieval results by the models trained on our Synth Triplets18M and a large-scale image-text paired dataset without access to the target triplet datasets.
Dataset Splits Yes Fashion IQ (Wu et al., 2021) has (46.6k / 15.5k / 15.5k) (training / validation / test) images with three fashion categories: Shirt, Dress, and Toptee. Each category has 18k training triplets and 12k evaluation triplets of xi R, xc, xi . ... CIRR has 36k open-domain triplets divided into the train, validation, and test sets in 8:1:1 split. ... CIRCO. This dataset comprises of 1020 queries, where 220 and 800 of them are used for validation and test, respectively.
Hardware Specification Yes The inference time was measured on a single A100 GPU with a batch size of 1.
Software Dependencies No The paper mentions using 'AdamW' as an optimizer and 'DDIM' for sampling variance, but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes For the efficient training, all visual features are pre-extracted and frozen. All training text embeddings are extracted at every iteration. To improve computational efficiency, we reduced the number of input tokens of the T5 models to 77, as in CLIP. A single-layer perceptron was employed to align the dimension of text embeddings extracted from T5-XL with that of CLIP Vi T-L/14. ... We report the detailed hyperparameters in Table A.1. (Table A.1 shows details like Diffusion steps: 1000, Sampling steps: 10, Dropout: 0.1, Weight decay: 6.0e-2, Batch size: 4096/2048, Iterations: 1M/200K/50K, Learning rate: 1e-4/1e-5, Optimizer: AdamW, EMA decay: 0.9999, Denoiser depth/heads/channels).