Memory Layers at Scale
Authors: Vincent-Pierre Berges, Barlas Oguz, Daniel Haziza, Wen-Tau Yih, Luke Zettlemoyer, Gargi Ghosh
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results (Section 4) indicate that memory layers improve the factual accuracy of language models by over 100% as measured by factual QA benchmarks, while also improving significantly on coding (Human Eval, MBPP) and general knowledge (Hellaswag, MMLU). |
| Researcher Affiliation | Industry | 1Meta FAIR. Correspondence to: Vincent Pierre Berges <EMAIL>, Barlas O guz <EMAIL>. |
| Pseudocode | No | The paper describes methods using mathematical equations (e.g., Equation (1), Equation (2)) and architectural diagrams (e.g., Figure 2, Figure 3), but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' sections, nor structured code-like formatted procedures. |
| Open Source Code | Yes | 1Our implementation is available at https://github. com/facebookresearch/memory |
| Open Datasets | Yes | Our evaluations cover factual question answering (Natural Questions (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017)), multi-hop question answering (Hotpot QA (Yang et al., 2018)), scientific and common sense world knowledge (MMLU (Hendrycks et al., 2021), Hella Swag (Zellers et al., 2019), OBQA (Mihaylov et al., 2018), PIQA (Bisk et al., 2019)) and coding (Human Eval (Chen et al., 2021), MBPP (Austin et al., 2021)). |
| Dataset Splits | No | The paper evaluates on well-known benchmarks such as Natural Questions, Trivia QA, and MMLU, which typically have predefined splits. However, the paper does not explicitly state the training/validation/test splits (e.g., percentages or sample counts) for these datasets within its text, nor does it provide citations specifically detailing the splits used for its evaluation. |
| Hardware Specification | Yes | Our forward pass optimizes memory accesses and achieves 3TB/s of memory bandwidth, which is close to our H100 specification of 3.35TB/s (compared to less than 400GB/s with Py Torch s implementation). |
| Software Dependencies | No | The paper mentions using 'Py Torch s Embedding Bag operation' and 'CUDA kernels' for implementation, but does not specify the version numbers for PyTorch, CUDA, or any other software dependencies, making it difficult to precisely reproduce the software environment. |
| Experiment Setup | Yes | Appendix A provides a table detailing 'Model Configurations' including 'Model Size', 'Embedding Dim.', 'Number of Layers', 'Attention Heads', and 'Learning Rate' for different base model sizes (134m, 373m, 720m, 1.3b, 8b). It also specifies that 'Memory and Memory+ experiments use 4 heads and 32 top-k values for the memory embedding lookups'. |