The AI Hippocampus: How Far are We From Human Memory?
Authors: Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The results, presented in Table 4 and Table 5, show the correctness rate and the average total time per question (in seconds), which includes data ingestion, retrieval, and reasoning. For all frameworks except the no-memory baseline, ingestion and retrieval constituted the majority of the processing time. We benchmarked a variety of memory frameworks, including a baseline with no memory, a simple RAG implementation using Chroma DB, Langchain s native FAISS RAG, Haystack, Llama Index, Mem0 (local version), and Zep (API version). |
| Researcher Affiliation | Academia | Zixia Jia 1 , Jiaqi Li 1 , Yipeng Kang 1 , Yuxuan Wang 1 , Tong Wu 1, Quansen Wang 1,2, Xiaobo Wang 1, Shuyi Zhang 1, Junzhe Shen 1, Qing Li 1, Siyuan Qi 1, Yitao Liang 2, Di He 2, Zilong Zheng 1 , Song-Chun Zhu State Key Laboratory of General Artificial Intelligence, BIGAI Peking University |
| Pseudocode | No | The paper describes various methodologies and architectures but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | The survey s website is available at https://github.com/bigai-nlco/LLM-Memory-Survey. |
| Open Datasets | Yes | For our evaluation, we selected the longmemeval_s_cleaned dataset from the official Long Mem Eval benchmark Wu et al. (2024a). |
| Dataset Splits | Yes | The full dataset comprises 2,500 questions. ... Due to the extensive average processing times of Mem0, Langchain, and Zep, we conducted their evaluations on a 10% random sample of the dataset. |
| Hardware Specification | No | The paper mentions using "Llama3-8B-IT" and "GPT-4o-mini" as reasoning engines and for evaluation, but it does not specify the underlying hardware (e.g., GPU/CPU models, memory, or cloud resources) on which these models or the benchmarked frameworks were run. |
| Software Dependencies | No | The paper mentions several frameworks and models such as "Chroma DB", "Langchain s native FAISS RAG", "Haystack", "Llama Index", "Mem0", "Zep", "Llama3-8B-IT", and "GPT-4o-mini", but it does not provide specific version numbers for any of these software components, libraries, or programming languages used. |
| Experiment Setup | No | The paper states, "For the reasoning engine, we employed two LLMs: Llama-3-8B-IT and GPT-4o-mini. All responses were evaluated for correctness using GPT-4o-mini." While it names the models used for evaluation, it does not provide specific hyperparameters (e.g., learning rate, batch size, temperature) or other detailed configuration settings for these models or the benchmarked frameworks that would be necessary to fully reproduce their evaluation setup. |