Minerva: A Programmable Memory Test Benchmark for Language Models

Authors: Menglin Xia, Victor Rühle, Saravan Rajmohan, Reza Shokri

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We run a comprehensive evaluation of several major open-source and black-box models (e.g., GPT-4(o), Cohere, Gemma, LLa MA, Mistral, Phi). Our experimental results show that while models perform relatively well on simple search tasks, they exhibit significant disparities across context utilization capabilities even at a context length of 4k tokens.
Researcher Affiliation Collaboration 1M365 Research, Microsoft 2National University of Singapore. Correspondence to: Menglin Xia <EMAIL>, Reza Shokri <EMAIL>.
Pseudocode No The paper includes a list of 'Test Templates' in Appendix A, which are prompt structures for generating test cases. However, these are examples of input prompts for the models being tested, not pseudocode or algorithm blocks for the methodology of the Minerva framework itself. There is no section explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes The code and data will be available at https://github. com/microsoft/minerva_memory_test.
Open Datasets Yes The code and data will be available at https://github. com/microsoft/minerva_memory_test.
Dataset Splits Yes We use the proposed framework to evaluate nine widely used language models on a fixed snapshot of 1110 randomly generated test samples. For all tests, we fixed the context length to 4k tokens, except in the Stateful Processing category, where the context length depends on the number of operation steps. We set the number of steps as 200 for quantity state and 100 for set state, corresponding to an approximate context length of 1.5k tokens. Further details on the number of examples, hyperparameter configurations, and evaluation metrics for the tests are provided in Appendices B and C.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments or evaluations.
Software Dependencies No The paper mentions ROUGE-L(Lin, 2004) and Jaccard similarity (Jaccard, 1901) as evaluation metrics but does not specify any software libraries or their version numbers used for implementation or other parts of the methodology.
Experiment Setup Yes For all tests, we fixed the context length to 4k tokens, except in the Stateful Processing category, where the context length depends on the number of operation steps. We set the number of steps as 200 for quantity state and 100 for set state, corresponding to an approximate context length of 1.5k tokens. For evaluation, we use exact match accuracy for binary tasks, ROUGE-L(Lin, 2004) for tests that require sequence overlap measurement, and Jaccard similarity (Jaccard, 1901) for set overlap. We set the max output token to 4096, temperature to 0, and top p to 1 for all model inference.