reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MemLLM: Finetuning LLMs to Use Explicit Read-Write Memory

Authors: Ali Modarressi, Abdullatif Köksal, Ayyoob Imani, Mohsen Fayyaz, Hinrich Schuetze

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments indicate that Mem LLM enhances the LLM s performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular. Our evaluation on Re-Doc RED (Tan et al., 2022) demonstrates that Mem LLM achieves better perplexity compared to baselines without memory components, with strong gains on named entities. We also show that Mem LLM outperforms non-memory-based methods on knowledge editing.
Researcher Affiliation	Collaboration	1Center for Information and Language Processing, LMU Munich, Germany 2Munich Center for Machine Learning, Germany 3Microsoft, Berlin, Germany
Pseudocode	Yes	Algorithm 1 presents the pseudocode for the process of generating Mem LLM s memory-read training data. See Section 3.3 for a detailed description of the same process.
Open Source Code	Yes	The project repository is publicly available at: https: //github.com/amodaresi/Mem LLM
Open Datasets	Yes	We use three such datasets. (i) Re-Doc RED (Tan et al., 2022): Wikipedia texts annotated (in a Wikidata format)... (ii) Doc RED s distant supervised training set... (iii) A set of counterfactual variations of Re-Doc RED (Modarressi et al., 2024)... Our primary source is a full dump of English Wikipedia2... available at: https://huggingface.co/datasets/wikimedia/wikipedia
Dataset Splits	Yes	We select 1000 examples from the humanannotated split of Doc RED as positive examples where the focus sentence is annotated as evidence. For negative examples, we choose 1000 examples where the focus sentence contains at least one entity but there is no evidence for the relation in the focus sentence.
Hardware Specification	No	The paper mentions finetuning with a Mistral-7B-v0.1 model but does not specify any hardware details like GPU/CPU models, memory, or cloud instances used for running the experiments.
Software Dependencies	No	The paper mentions using a Mistral-7B-v0.1 model, Adam optimizer, and LoRA, but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or the Hugging Face Transformers library.
Experiment Setup	Yes	We finetune Mem LLM, with a Mistral-7B-v0.1 model (Jiang et al., 2023) using an Adam optimizer (Kingma & Ba, 2015), with the learning rate set to 2 10 5, 2 epochs, and a batch size of 96. For Lo RA specific parameters, we apply a dropout rate of 0.1, with a rank of 16 and an alpha weight of 8. We opted to set Qthr to 30... We set τe and τt to 0.7 and τr to 0.85. We set these values to τe = 0.85, τt = 0.2 and τr = 0.6 respectively for model editing experiments.