M+: Extending MemoryLLM with Scalable Long-Term Memory

Authors: Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian Mcauley, Dan Gutfreund, Rogerio Feris, Zexue He

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms Memory LLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead. 4. Experiments 4.1. Long Book QA and Event QA 4.1.2. Experimental Results 4.2. GPU Cost Comparison 4.3. Knowledge Retention Experiments 4.5. Ablation Study
Researcher Affiliation Collaboration Yu Wang 1 Dmitry Krotov 2 3 Yuanzhe Hu 1 Yifan Gao 4 Wangchunshu Zhou 5 Julian Mc Auley 1 Dan Gutfreund 2 3 Rogerio Feris 2 3 Zexue He 1 2 3 1UC, San Diego 2MIT-IBM Waston Lab 3IBM Research 4Amazon 5OPPO. Correspondence to: Yu Wang <EMAIL>, Zexue He <EMAIL>.
Pseudocode No The paper describes methods through textual explanations and diagrams (e.g., Figure 1) but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We open-source our code at https://github. com/wangyu-ustc/Memory LLM.
Open Datasets Yes We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Long Book-QA: This dataset is part of -Bench (Zhang et al., 2024) Long Book Event QA: We propose this new benchmark to evaluate the model s ability to recall past events and reason chronologically. We first continually train ϕ equipped with θ on the dataset fineweb-edu (Penedo et al., 2024) We evaluate the performance of M+ and Llama-3.1-8B on relatively short documents using the Long Bench benchmark, considering input lengths of 8k and 16k tokens. The evaluation metric is QA-F1, following Bai et al. (2023). To evaluate the ability of M+ to recall long-term knowledge, we follow the experimental setup in Memory LLM (Wang et al., 2024a) on datasets SQu AD and Natural QA, formatted as (context, question, answer)
Dataset Splits Yes We extract documents from Slim Pajama that range from 4k to 64k tokens and split them into four categories based on their lengths: 4k-8k, 8k-16k, 16k-32k, 32k-64k. The statistics of obtained dataset is shown in Appendix C. For each length range, we randomly sample 200,000 examples, and they are combined with a snapshot of fineweb in equal proportions (1:1:1:1:1), with each subset contributing to 20% of the total data. After filtering out ambiguous examples that gpt-4o-mini fails to answer, we select the first 100 examples from the remaining answerable set to conduct our evaluation. We evaluate the three models on a held-out subset of Slim Pajama containing 1000 examples with lengths between 32k and 64k tokens.
Hardware Specification Yes We build M+ on top of Llama3.1-8B (Dubey et al., 2024) and train it using eight A100 GPUs. All experiments in this section are conducted on a single H100 GPU.
Software Dependencies No The paper mentions using 'deepspeed-stage-2', 'FSDP', 'accelerate', 'Spa Cy' (for NER), and 'gpt-4o' (for generating questions) but does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes Specifically, we set K = 256, N = 10240 (N is the number of tokens in the short-term memory, see Section 3.1), and the number of tokens of extracted LTM in Figure 1 is set to 2,560. The generation window (i.e., the maximum length of generation) is set to be 2,048. Continual Training of Memory LLM (Stage 1) ... for 1,200,000 steps over four weeks Training runs for one epoch with around one week using the same training tasks in Stage 1. In stage 3, we adjust the configuration by setting θl to 10,240 tokens and retrieving K0 = 2, 560 tokens from the long-term memory, maintaining a total of 12,800 memory tokens as in the previous stages.