Episodic Memories Generation and Evaluation Benchmark for Large Language Models

Authors: Alexis Huet, Zied Houidi, Dario Rossi

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, Llama 3.1, and o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks
Researcher Affiliation Industry Alexis Huet , Zied Ben Houidi , , Dario Rossi Huawei Technologies Co., Ltd., Paris, France {first(.mid).last}@huawei.com
Pseudocode No The paper describes methods and models conceptually (e.g., "We model cue-based recall as a key-value retrieval system"), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. (...) Code and data available at Huet et al. (2025).
Open Datasets Yes We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. (...) We introduce a framework for generating synthetic episodic memory datasets1 comprising narratives of events and corresponding question-answer pairs that could also be used to generate synthetic tasks for training purposes. We further release 11 datasets2, differing in size and diversity, to evaluate LLM performance across various episodic memory tasks.
Dataset Splits Yes The fine-tuning process thus uses 3,199 training questions for the long book (filtered from the total 3,886 questions involving one or several events, as mentioned in Sec. B.2.2) and 468 training questions for the short book. (...) all questions involving a single chapter (i.e., corresponding to the bin {1}) are present in both the training and the test sets, while all other questions appear only in the test set.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. It mentions 'Fine-tuning using the Open AI API', which implies a cloud service, but no underlying hardware specifications are given.
Software Dependencies No The paper mentions 'text-embedding-3-small' and 'Open AI API' for fine-tuning, but does not provide specific version numbers for these or any other software libraries or frameworks used in their implementation.
Experiment Setup Yes Lastly, (3) we fine-tune models5 using all single-event question-answer pairs as training data (details in Appendix B.2.5). (...) Fine-tuning using the Open AI API over 30 epochs, a batch size of 64 and a learning rate multiplier of 1.8.