reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Authors: Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, Dong Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce LONGMEMEVAL, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LONGMEMEVAL presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into three stages: indexing, retrieval, and reading. Built upon key experimental insights, we propose several memory design optimizations including session decomposition for value granularity, fact-augmented key expansion for indexing, and time-aware query expansion for refining the search scope. Extensive experiments show that these optimizations greatly improve both memory recall and downstream question answering on LONGMEMEVAL.
Researcher Affiliation	Collaboration	1UCLA, 2Tencent AI Lab Seattle, 3UC San Diego EMAIL
Pseudocode	No	The paper describes methods and processes through textual descriptions and diagrams (e.g., Figure 4), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps formatted like code.
Open Source Code	Yes	Our benchmark and code are publicly available at https://github.com/xiaowu0162/Long Mem Eval.
Open Datasets	Yes	Our benchmark and code are publicly available at https://github.com/xiaowu0162/Long Mem Eval. ... We have created and will publicly release the two fixed evaluation datasets, LONGMEMEVALS and LONGMEMEVALM. In addition, we will also release the algorithm and source mixture used to create these two datasets, so that future studies could build upon them to create chat histories of any length. ... We draw the irrelevant sessions from two sources: (1) self-chat sessions simulated based on other non-conflicting attributes and (2) publicly released user-AI style chat data including Share GPT (Zheng et al., 2023) and Ultra Chat (Ding et al., 2023).
Dataset Splits	No	While the pipeline allows us to compile chat histories of arbitrary length, we provide two standard settings: LONGMEMEVALS (approximately 115k tokens/question) and LONGMEMEVALM (500 sessions, 1.5M tokens).
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or memory) used for running its experiments. It mentions using various LLMs such as GPT-4o and Llama models, but no underlying hardware specifications are provided.
Software Dependencies	Yes	Unless otherwise mentioned, Llama 3 70B Instruct (Dubey et al., 2024) is used as the LLM in the pipeline. ... We mainly study three LLMs: GPT-4o, Llama 3.1 70B Instruct, and Llama 3.1 8B Instruct3. For the retriever, we choose dense retrieval with the 1.5B Stella V5 model (Zhang, 2023) ... Specifically, we prompt-engineer the gpt-4o-2024-08-06 model via the Open AI API.
Experiment Setup	Yes	We mainly study three LLMs: GPT-4o, Llama 3.1 70B Instruct, and Llama 3.1 8B Instruct3. For the retriever, we choose dense retrieval with the 1.5B Stella V5 model (Zhang, 2023), given its high performance on MTEB (Muennighoff et al., 2023). For the indexing stage, we employ Llama 3.1 8B Instruct to extract summaries, keyphrases, user facts, and timestamped events. When sessions or rounds are used as the key, we only keep the user-side utterances. In the reading stage, the retrieved items are always sorted by their timestamp to help the reader model maintain temporal consistency. Throughout 5.2 to 5.4, we apply Chain-of-Note and json format (discussed in 5.5) by default.