reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Episodic Memories Generation and Evaluation Benchmark for Large Language Models

Authors: Alexis Huet, Zied Houidi, Dario Rossi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, Llama 3.1, and o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks
Researcher Affiliation	Industry	Alexis Huet , Zied Ben Houidi , , Dario Rossi Huawei Technologies Co., Ltd., Paris, France {first(.mid).last}@huawei.com
Pseudocode	No	The paper describes methods and models conceptually (e.g., "We model cue-based recall as a key-value retrieval system"), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	Yes	We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. (...) Code and data available at Huet et al. (2025).
Open Datasets	Yes	We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. (...) We introduce a framework for generating synthetic episodic memory datasets1 comprising narratives of events and corresponding question-answer pairs that could also be used to generate synthetic tasks for training purposes. We further release 11 datasets2, differing in size and diversity, to evaluate LLM performance across various episodic memory tasks.
Dataset Splits	Yes	The fine-tuning process thus uses 3,199 training questions for the long book (filtered from the total 3,886 questions involving one or several events, as mentioned in Sec. B.2.2) and 468 training questions for the short book. (...) all questions involving a single chapter (i.e., corresponding to the bin {1}) are present in both the training and the test sets, while all other questions appear only in the test set.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. It mentions 'Fine-tuning using the Open AI API', which implies a cloud service, but no underlying hardware specifications are given.
Software Dependencies	No	The paper mentions 'text-embedding-3-small' and 'Open AI API' for fine-tuning, but does not provide specific version numbers for these or any other software libraries or frameworks used in their implementation.
Experiment Setup	Yes	Lastly, (3) we fine-tune models5 using all single-event question-answer pairs as training data (details in Appendix B.2.5). (...) Fine-tuning using the Open AI API over 30 epochs, a batch size of 64 and a learning rate multiplier of 1.8.