MAI: A Multi-turn Aggregation-Iteration Model for Composed Image Retrieval
Authors: Yanzhe Chen, Zhiwen Yang, Jinglin Xu, Yuxin Peng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that the proposed MAI model achieves substantial improvements over state-of-the-art methods. |
| Researcher Affiliation | Academia | 1Wangxuan Institute of Computer Technology, Peking University 2School of Intelligence Science and Technology, University of Science and Technology Beijing EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the Multi-turn Iterative Optimization (MIO) mechanism using mathematical formulations (Eq. 1, 2, 3) and descriptive text, but it does not present it in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | The dataset and source code are available at https://github.com/PKU-ICST-MIPL/MAI_ICLR2025. |
| Open Datasets | Yes | The dataset and source code are available at https://github.com/PKU-ICST-MIPL/MAI_ICLR2025. |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages or exact counts) for the experiments. It mentions using "20% of the data from each dataset for manual scoring" for quality assessment, but not for model training or evaluation. |
| Hardware Specification | Yes | All model training and inference are conducted on 8 V100 GPUs. |
| Software Dependencies | Yes | We adopt BLIP-2 Li et al. (2023) with the Flan-t5-xxl language model Chung et al. (2024) for image captioning and Xwin-13B-V0.2 Ni et al. (2024) as the LLM. Optimization is performed using Adam W Loshchilov & Hutter (2019) |
| Experiment Setup | Yes | Optimization is performed using Adam W Loshchilov & Hutter (2019) with a batch size of 16, an initial learning rate of 1e-5, and cosine annealing. Training runs for 50 epochs, while inference uses a batch size of 2048. All model training and inference are conducted on 8 V100 GPUs. The number of learned tokens is fixed at 32, and 32 tokens are retained each turn through the MIO. |