reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Minerva: A Programmable Memory Test Benchmark for Language Models

Authors: Menglin Xia, Victor Rühle, Saravan Rajmohan, Reza Shokri

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run a comprehensive evaluation of several major open-source and black-box models (e.g., GPT-4(o), Cohere, Gemma, LLa MA, Mistral, Phi). Our experimental results show that while models perform relatively well on simple search tasks, they exhibit significant disparities across context utilization capabilities even at a context length of 4k tokens.
Researcher Affiliation	Collaboration	1M365 Research, Microsoft 2National University of Singapore. Correspondence to: Menglin Xia <EMAIL>, Reza Shokri <EMAIL>.
Pseudocode	No	The paper includes a list of 'Test Templates' in Appendix A, which are prompt structures for generating test cases. However, these are examples of input prompts for the models being tested, not pseudocode or algorithm blocks for the methodology of the Minerva framework itself. There is no section explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	The code and data will be available at https://github. com/microsoft/minerva_memory_test.
Open Datasets	Yes	The code and data will be available at https://github. com/microsoft/minerva_memory_test.
Dataset Splits	Yes	We use the proposed framework to evaluate nine widely used language models on a fixed snapshot of 1110 randomly generated test samples. For all tests, we fixed the context length to 4k tokens, except in the Stateful Processing category, where the context length depends on the number of operation steps. We set the number of steps as 200 for quantity state and 100 for set state, corresponding to an approximate context length of 1.5k tokens. Further details on the number of examples, hyperparameter configurations, and evaluation metrics for the tests are provided in Appendices B and C.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments or evaluations.
Software Dependencies	No	The paper mentions ROUGE-L(Lin, 2004) and Jaccard similarity (Jaccard, 1901) as evaluation metrics but does not specify any software libraries or their version numbers used for implementation or other parts of the methodology.
Experiment Setup	Yes	For all tests, we fixed the context length to 4k tokens, except in the Stateful Processing category, where the context length depends on the number of operation steps. We set the number of steps as 200 for quantity state and 100 for set state, corresponding to an approximate context length of 1.5k tokens. For evaluation, we use exact match accuracy for binary tasks, ROUGE-L(Lin, 2004) for tests that require sequence overlap measurement, and Jaccard similarity (Jaccard, 1901) for set overlap. We set the max output token to 4096, temperature to 0, and top p to 1 for all model inference.