The AI Hippocampus: How Far are We From Human Memory?

Authors: Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results, presented in Table 4 and Table 5, show the correctness rate and the average total time per question (in seconds), which includes data ingestion, retrieval, and reasoning. For all frameworks except the no-memory baseline, ingestion and retrieval constituted the majority of the processing time. We benchmarked a variety of memory frameworks, including a baseline with no memory, a simple RAG implementation using Chroma DB, Langchain s native FAISS RAG, Haystack, Llama Index, Mem0 (local version), and Zep (API version).
Researcher Affiliation Academia Zixia Jia 1 , Jiaqi Li 1 , Yipeng Kang 1 , Yuxuan Wang 1 , Tong Wu 1, Quansen Wang 1,2, Xiaobo Wang 1, Shuyi Zhang 1, Junzhe Shen 1, Qing Li 1, Siyuan Qi 1, Yitao Liang 2, Di He 2, Zilong Zheng 1 , Song-Chun Zhu State Key Laboratory of General Artificial Intelligence, BIGAI Peking University
Pseudocode No The paper describes various methodologies and architectures but does not include any explicit pseudocode blocks or algorithms.
Open Source Code Yes The survey s website is available at https://github.com/bigai-nlco/LLM-Memory-Survey.
Open Datasets Yes For our evaluation, we selected the longmemeval_s_cleaned dataset from the official Long Mem Eval benchmark Wu et al. (2024a).
Dataset Splits Yes The full dataset comprises 2,500 questions. ... Due to the extensive average processing times of Mem0, Langchain, and Zep, we conducted their evaluations on a 10% random sample of the dataset.
Hardware Specification No The paper mentions using "Llama3-8B-IT" and "GPT-4o-mini" as reasoning engines and for evaluation, but it does not specify the underlying hardware (e.g., GPU/CPU models, memory, or cloud resources) on which these models or the benchmarked frameworks were run.
Software Dependencies No The paper mentions several frameworks and models such as "Chroma DB", "Langchain s native FAISS RAG", "Haystack", "Llama Index", "Mem0", "Zep", "Llama3-8B-IT", and "GPT-4o-mini", but it does not provide specific version numbers for any of these software components, libraries, or programming languages used.
Experiment Setup No The paper states, "For the reasoning engine, we employed two LLMs: Llama-3-8B-IT and GPT-4o-mini. All responses were evaluated for correctness using GPT-4o-mini." While it names the models used for evaluation, it does not provide specific hyperparameters (e.g., learning rate, batch size, temperature) or other detailed configuration settings for these models or the benchmarked frameworks that would be necessary to fully reproduce their evaluation setup.