Episodic Memories Generation and Evaluation Benchmark for Large Language Models
Authors: Alexis Huet, Zied Houidi, Dario Rossi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, Llama 3.1, and o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks |
| Researcher Affiliation | Industry | Alexis Huet , Zied Ben Houidi , , Dario Rossi Huawei Technologies Co., Ltd., Paris, France {first(.mid).last}@huawei.com |
| Pseudocode | No | The paper describes methods and models conceptually (e.g., "We model cue-based recall as a key-value retrieval system"), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | Yes | We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. (...) Code and data available at Huet et al. (2025). |
| Open Datasets | Yes | We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. (...) We introduce a framework for generating synthetic episodic memory datasets1 comprising narratives of events and corresponding question-answer pairs that could also be used to generate synthetic tasks for training purposes. We further release 11 datasets2, differing in size and diversity, to evaluate LLM performance across various episodic memory tasks. |
| Dataset Splits | Yes | The fine-tuning process thus uses 3,199 training questions for the long book (filtered from the total 3,886 questions involving one or several events, as mentioned in Sec. B.2.2) and 468 training questions for the short book. (...) all questions involving a single chapter (i.e., corresponding to the bin {1}) are present in both the training and the test sets, while all other questions appear only in the test set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. It mentions 'Fine-tuning using the Open AI API', which implies a cloud service, but no underlying hardware specifications are given. |
| Software Dependencies | No | The paper mentions 'text-embedding-3-small' and 'Open AI API' for fine-tuning, but does not provide specific version numbers for these or any other software libraries or frameworks used in their implementation. |
| Experiment Setup | Yes | Lastly, (3) we fine-tune models5 using all single-event question-answer pairs as training data (details in Appendix B.2.5). (...) Fine-tuning using the Open AI API over 30 epochs, a batch size of 64 and a learning rate multiplier of 1.8. |