MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines

Authors: Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, jiayi lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Gao Peng, Yu Liu, Chunyuan Li, Hongsheng Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSEARCH-ENGINE achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSEARCH may provide unique insights to guide the future development of multimodal AI search engine.
Researcher Affiliation Collaboration Dongzhi Jiang1 , Renrui Zhang1,3 , Ziyu Guo2, Yanmin Wu5, Jiayi Lei4 Pengshuo Qiu4, Pan Lu6, Zehui Chen3, Guanglu Song7 Peng Gao4, Yu Liu7, Chunyuan Li3, Hongsheng Li1,8 1CUHK MMLab & 2Miu Lar Lab 3Byte Dance 4Shanghai AI Laboratory 5Peking University 6Stanford University 7Sensetime Research 8 CPII under Inno HK EMAIL
Pseudocode No The paper describes a pipeline with three sequential phases (Requery, Rerank, Summarization) in Section 2.1, but these are described in natural language and illustrated with a diagram (Figure 2), rather than presented as structured pseudocode or algorithm blocks.
Open Source Code Yes We provide the demo code of MMSEARCH-ENGINE, which can be also used for inference, in the code directory in the supplementary material.
Open Datasets Yes MMSEARCH, a comprehensive benchmark for multimodal AI search engines, which, to our best knowledge, serves as the first evaluation dataset to measure LMMs multimodal searching capabilities. Our benchmark categorizes searching queries into two primary areas: News and Knowledge, as shown in Fig. 1. We also include the end-to-end data in the data directory in the supplementary material, along with the loading script.
Dataset Splits No The paper introduces MMSEARCH as a comprehensive evaluation benchmark with 300 queries categorized by News and Knowledge areas, and further by difficulty levels (hard, medium, easy) for analysis. However, it does not specify explicit training/validation/test splits in the traditional sense for models described or trained within this paper, as its primary purpose is to evaluate existing LMMs on this new benchmark.
Hardware Specification Yes We conduct all experiments on NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using Google Lens and a text embedding model (Chen et al., 2024a) but does not provide specific version numbers for any software dependencies used in their own implementation, such as programming languages or libraries.
Experiment Setup Yes We set the number of retrieved websites K as 8. We include two image input resolution settings. For the default settings, the longest edge of the input image is resized to match the largest resolution of the vision encoder of LMM. For any resolution settings, we input the image without resizing. The requery prompt contains 3 examples to better guide LMMs to output a valid requery. While prompts for other tasks are all in a zero-shot setting.