MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

Authors: Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of Mind Search, we conducted extensive evaluations on both closed-set and open-set question-answering (QA) problems using GPT-4o and Intern LM2.5-7B-Chat models. The experimental results demonstrate a substantial improvement in response quality, both in the dimensions of depth and breadth. Moreover, comparative analysis shows that the responses of Mind Search are more preferred by human evaluators over those from existing applications like Chat GPT-Web (based on GPT-4o) and Perplexity Pro. In Table 1, we compare our approach with two straight-forward baselines: raw LLM without search engines (w/o Search Engine), and simply treating search engines as an external tool and adopting a Re Act-style interaction (Re Act Search). In this section, we conduct detailed ablation studies aiming to gain a deeper understanding of our approach.
Researcher Affiliation Collaboration 1Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC 2Shanghai AI Laboratory EMAIL EMAIL EMAIL
Pseudocode No The paper describes the methodology through prose and architectural diagrams (Figures 1, 2, 3), and mentions the Web Planner interacting with the graph 'via Python code generation' and 'predefined atomic code functions'. However, it does not include any explicit pseudocode blocks or algorithms labeled as such, nor does it present the structured steps of its core algorithms in pseudocode format.
Open Source Code Yes Code is available at https://github.com/Intern LM/Mind Search.
Open Datasets Yes We extensively evaluate our approach on a wide range of closed-set QA tasks, including Bamboogle (Press et al., 2022), Musique (Trivedi et al., 2022), and Hotpot QA (Yang et al., 2018).
Dataset Splits No The paper mentions using well-known datasets like Bamboogle, Musique, and Hotpot QA, and discusses performance across different difficulty levels (e.g., 2-hop, 3-hop, 4-hop for Musique; Easy, Medium, Hard for Hotpot QA). However, it does not explicitly state the training, validation, or test split percentages or sample counts used for these datasets in the experimental setup, nor does it explicitly reference the use of standard predefined splits for its own experimental process.
Hardware Specification No The paper mentions 'GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC' in the acknowledgements, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies No The paper mentions using specific LLMs like 'GPT-4o' and 'Intern LM2.5-7b-chat' and refers to 'Python programming in a Jupyter environment' and a 'Python code interpreter'. However, it does not provide specific version numbers for Python, Jupyter, or any other critical software libraries (e.g., PyTorch, TensorFlow) that would be necessary to replicate the experimental environment.
Experiment Setup Yes To better gauge the utility and search performance, we carefully curate 100 real-world human queries and collect responses from Mind Search (Intern LM2.5-7b-chat (Cai et al., 2024)), Perplexity.ai (its Pro version), and Chat GPT with search plugin (Achiam et al., 2023)... we select both closed-source LLM (GPT-4o) and open-source LLM (Intern LM2.5-7b-chat) as our LLM backend. Since our approach adopts a zero-shot experimental setting, we utilize a subjective LLM evaluator (GPT4-o) to gauge the correctness of Hotpot QA... during experiments, we limit the max interaction turn to 10... System Prompt for Web Planner and System Prompt for Web Searcher (Appendix G).