reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

Authors: Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate the effectiveness of Mind Search, we conducted extensive evaluations on both closed-set and open-set question-answering (QA) problems using GPT-4o and Intern LM2.5-7B-Chat models. The experimental results demonstrate a substantial improvement in response quality, both in the dimensions of depth and breadth. Moreover, comparative analysis shows that the responses of Mind Search are more preferred by human evaluators over those from existing applications like Chat GPT-Web (based on GPT-4o) and Perplexity Pro. In Table 1, we compare our approach with two straight-forward baselines: raw LLM without search engines (w/o Search Engine), and simply treating search engines as an external tool and adopting a Re Act-style interaction (Re Act Search). In this section, we conduct detailed ablation studies aiming to gain a deeper understanding of our approach.
Researcher Affiliation	Collaboration	1Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC 2Shanghai AI Laboratory EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the methodology through prose and architectural diagrams (Figures 1, 2, 3), and mentions the Web Planner interacting with the graph 'via Python code generation' and 'predefined atomic code functions'. However, it does not include any explicit pseudocode blocks or algorithms labeled as such, nor does it present the structured steps of its core algorithms in pseudocode format.
Open Source Code	Yes	Code is available at https://github.com/Intern LM/Mind Search.
Open Datasets	Yes	We extensively evaluate our approach on a wide range of closed-set QA tasks, including Bamboogle (Press et al., 2022), Musique (Trivedi et al., 2022), and Hotpot QA (Yang et al., 2018).
Dataset Splits	No	The paper mentions using well-known datasets like Bamboogle, Musique, and Hotpot QA, and discusses performance across different difficulty levels (e.g., 2-hop, 3-hop, 4-hop for Musique; Easy, Medium, Hard for Hotpot QA). However, it does not explicitly state the training, validation, or test split percentages or sample counts used for these datasets in the experimental setup, nor does it explicitly reference the use of standard predefined splits for its own experimental process.
Hardware Specification	No	The paper mentions 'GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC' in the acknowledgements, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies	No	The paper mentions using specific LLMs like 'GPT-4o' and 'Intern LM2.5-7b-chat' and refers to 'Python programming in a Jupyter environment' and a 'Python code interpreter'. However, it does not provide specific version numbers for Python, Jupyter, or any other critical software libraries (e.g., PyTorch, TensorFlow) that would be necessary to replicate the experimental environment.
Experiment Setup	Yes	To better gauge the utility and search performance, we carefully curate 100 real-world human queries and collect responses from Mind Search (Intern LM2.5-7b-chat (Cai et al., 2024)), Perplexity.ai (its Pro version), and Chat GPT with search plugin (Achiam et al., 2023)... we select both closed-source LLM (GPT-4o) and open-source LLM (Intern LM2.5-7b-chat) as our LLM backend. Since our approach adopts a zero-shot experimental setting, we utilize a subjective LLM evaluator (GPT4-o) to gauge the correctness of Hotpot QA... during experiments, we limit the max interaction turn to 10... System Prompt for Web Planner and System Prompt for Web Searcher (Appendix G).