reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The AI Hippocampus: How Far are We From Human Memory?

Authors: Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The results, presented in Table 4 and Table 5, show the correctness rate and the average total time per question (in seconds), which includes data ingestion, retrieval, and reasoning. For all frameworks except the no-memory baseline, ingestion and retrieval constituted the majority of the processing time. We benchmarked a variety of memory frameworks, including a baseline with no memory, a simple RAG implementation using Chroma DB, Langchain s native FAISS RAG, Haystack, Llama Index, Mem0 (local version), and Zep (API version).
Researcher Affiliation	Academia	Zixia Jia 1 , Jiaqi Li 1 , Yipeng Kang 1 , Yuxuan Wang 1 , Tong Wu 1, Quansen Wang 1,2, Xiaobo Wang 1, Shuyi Zhang 1, Junzhe Shen 1, Qing Li 1, Siyuan Qi 1, Yitao Liang 2, Di He 2, Zilong Zheng 1 , Song-Chun Zhu State Key Laboratory of General Artificial Intelligence, BIGAI Peking University
Pseudocode	No	The paper describes various methodologies and architectures but does not include any explicit pseudocode blocks or algorithms.
Open Source Code	Yes	The survey s website is available at https://github.com/bigai-nlco/LLM-Memory-Survey.
Open Datasets	Yes	For our evaluation, we selected the longmemeval_s_cleaned dataset from the official Long Mem Eval benchmark Wu et al. (2024a).
Dataset Splits	Yes	The full dataset comprises 2,500 questions. ... Due to the extensive average processing times of Mem0, Langchain, and Zep, we conducted their evaluations on a 10% random sample of the dataset.
Hardware Specification	No	The paper mentions using "Llama3-8B-IT" and "GPT-4o-mini" as reasoning engines and for evaluation, but it does not specify the underlying hardware (e.g., GPU/CPU models, memory, or cloud resources) on which these models or the benchmarked frameworks were run.
Software Dependencies	No	The paper mentions several frameworks and models such as "Chroma DB", "Langchain s native FAISS RAG", "Haystack", "Llama Index", "Mem0", "Zep", "Llama3-8B-IT", and "GPT-4o-mini", but it does not provide specific version numbers for any of these software components, libraries, or programming languages used.
Experiment Setup	No	The paper states, "For the reasoning engine, we employed two LLMs: Llama-3-8B-IT and GPT-4o-mini. All responses were evaluated for correctness using GPT-4o-mini." While it names the models used for evaluation, it does not provide specific hyperparameters (e.g., learning rate, batch size, temperature) or other detailed configuration settings for these models or the benchmarked frameworks that would be necessary to fully reproduce their evaluation setup.