reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Memory Layers at Scale

Authors: Vincent-Pierre Berges, Barlas Oguz, Daniel Haziza, Wen-Tau Yih, Luke Zettlemoyer, Gargi Ghosh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results (Section 4) indicate that memory layers improve the factual accuracy of language models by over 100% as measured by factual QA benchmarks, while also improving significantly on coding (Human Eval, MBPP) and general knowledge (Hellaswag, MMLU).
Researcher Affiliation	Industry	1Meta FAIR. Correspondence to: Vincent Pierre Berges <EMAIL>, Barlas O guz <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical equations (e.g., Equation (1), Equation (2)) and architectural diagrams (e.g., Figure 2, Figure 3), but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' sections, nor structured code-like formatted procedures.
Open Source Code	Yes	1Our implementation is available at https://github. com/facebookresearch/memory
Open Datasets	Yes	Our evaluations cover factual question answering (Natural Questions (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017)), multi-hop question answering (Hotpot QA (Yang et al., 2018)), scientific and common sense world knowledge (MMLU (Hendrycks et al., 2021), Hella Swag (Zellers et al., 2019), OBQA (Mihaylov et al., 2018), PIQA (Bisk et al., 2019)) and coding (Human Eval (Chen et al., 2021), MBPP (Austin et al., 2021)).
Dataset Splits	No	The paper evaluates on well-known benchmarks such as Natural Questions, Trivia QA, and MMLU, which typically have predefined splits. However, the paper does not explicitly state the training/validation/test splits (e.g., percentages or sample counts) for these datasets within its text, nor does it provide citations specifically detailing the splits used for its evaluation.
Hardware Specification	Yes	Our forward pass optimizes memory accesses and achieves 3TB/s of memory bandwidth, which is close to our H100 specification of 3.35TB/s (compared to less than 400GB/s with Py Torch s implementation).
Software Dependencies	No	The paper mentions using 'Py Torch s Embedding Bag operation' and 'CUDA kernels' for implementation, but does not specify the version numbers for PyTorch, CUDA, or any other software dependencies, making it difficult to precisely reproduce the software environment.
Experiment Setup	Yes	Appendix A provides a table detailing 'Model Configurations' including 'Model Size', 'Embedding Dim.', 'Number of Layers', 'Attention Heads', and 'Learning Rate' for different base model sizes (134m, 373m, 720m, 1.3b, 8b). It also specifies that 'Memory and Memory+ experiments use 4 heads and 32 top-k values for the memory embedding lookups'.