reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

M+: Extending MemoryLLM with Scalable Long-Term Memory

Authors: Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian Mcauley, Dan Gutfreund, Rogerio Feris, Zexue He

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms Memory LLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead. 4. Experiments 4.1. Long Book QA and Event QA 4.1.2. Experimental Results 4.2. GPU Cost Comparison 4.3. Knowledge Retention Experiments 4.5. Ablation Study
Researcher Affiliation	Collaboration	Yu Wang 1 Dmitry Krotov 2 3 Yuanzhe Hu 1 Yifan Gao 4 Wangchunshu Zhou 5 Julian Mc Auley 1 Dan Gutfreund 2 3 Rogerio Feris 2 3 Zexue He 1 2 3 1UC, San Diego 2MIT-IBM Waston Lab 3IBM Research 4Amazon 5OPPO. Correspondence to: Yu Wang <EMAIL>, Zexue He <EMAIL>.
Pseudocode	No	The paper describes methods through textual explanations and diagrams (e.g., Figure 1) but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We open-source our code at https://github. com/wangyu-ustc/Memory LLM.
Open Datasets	Yes	We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Long Book-QA: This dataset is part of -Bench (Zhang et al., 2024) Long Book Event QA: We propose this new benchmark to evaluate the model s ability to recall past events and reason chronologically. We first continually train ϕ equipped with θ on the dataset fineweb-edu (Penedo et al., 2024) We evaluate the performance of M+ and Llama-3.1-8B on relatively short documents using the Long Bench benchmark, considering input lengths of 8k and 16k tokens. The evaluation metric is QA-F1, following Bai et al. (2023). To evaluate the ability of M+ to recall long-term knowledge, we follow the experimental setup in Memory LLM (Wang et al., 2024a) on datasets SQu AD and Natural QA, formatted as (context, question, answer)
Dataset Splits	Yes	We extract documents from Slim Pajama that range from 4k to 64k tokens and split them into four categories based on their lengths: 4k-8k, 8k-16k, 16k-32k, 32k-64k. The statistics of obtained dataset is shown in Appendix C. For each length range, we randomly sample 200,000 examples, and they are combined with a snapshot of fineweb in equal proportions (1:1:1:1:1), with each subset contributing to 20% of the total data. After filtering out ambiguous examples that gpt-4o-mini fails to answer, we select the first 100 examples from the remaining answerable set to conduct our evaluation. We evaluate the three models on a held-out subset of Slim Pajama containing 1000 examples with lengths between 32k and 64k tokens.
Hardware Specification	Yes	We build M+ on top of Llama3.1-8B (Dubey et al., 2024) and train it using eight A100 GPUs. All experiments in this section are conducted on a single H100 GPU.
Software Dependencies	No	The paper mentions using 'deepspeed-stage-2', 'FSDP', 'accelerate', 'Spa Cy' (for NER), and 'gpt-4o' (for generating questions) but does not provide specific version numbers for these software components or libraries.
Experiment Setup	Yes	Specifically, we set K = 256, N = 10240 (N is the number of tokens in the short-term memory, see Section 3.1), and the number of tokens of extracted LTM in Figure 1 is set to 2,560. The generation window (i.e., the maximum length of generation) is set to be 2,048. Continual Training of Memory LLM (Stage 1) ... for 1,200,000 steps over four weeks Training runs for one epoch with around one week using the same training tasks in Stage 1. In stage 3, we adjust the configuration by setting θl to 10,240 tokens and retrieving K0 = 2, 560 tokens from the long-term memory, maintaining a total of 12,800 memory tokens as in the previous stages.