Human-inspired Episodic Memory for Infinite Context LLMs
Authors: Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the Long Bench and -Bench benchmarks demonstrate EM-LLM s superior performance, consistently outperforming the state-of-the-art retrieval model Inf LLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM s performance even surpasses full-context models in most tasks, while successfully performing retrieval across 10 million tokens a scale computationally infeasible for such models. |
| Researcher Affiliation | Collaboration | 1Huawei Noah s Ark Lab, London, UK 2AI Centre, Department of Computer Science, University College London, London, UK EMAIL EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Event segmentation in KV cache Input: tok: List of tokens in the sequence Input: T: Threshold for surprisal to identify initial boundaries Input: f: Metric function to evaluate potential boundaries Output: B: List of final boundary positions 1: B [i for i in range(length(tok)) if log(P(tok[i])) > T] Boundary identification 2: for i in range(length(B)) do 3: α, β = B[i], B[i + 1] 4: B[i + 1] arg max ˆβ (α,β] f(A, {α, ˆβ}) Boundary refinement 5: end for 6: return B |
| Open Source Code | Yes | Code available at: https://github.com/em-llm/EM-LLM-model |
| Open Datasets | Yes | Experiments on the Long Bench and -Bench benchmarks demonstrate EM-LLM s superior performance... To further prove our hypotheses, we then employ a series of human-annotated podcast scripts... Finally, using the long-context PG-19 dataset (Rae et al., 2020)... |
| Dataset Splits | No | The paper evaluates its performance on the Long Bench (Bai et al., 2023) and -Bench (Zhang et al., 2024) benchmarks. While these are established benchmarks, the paper does not explicitly state the specific training, validation, or test dataset splits used for its experiments within the main text. |
| Hardware Specification | Yes | All of our experiments were run on single nodes of 4 GPUs, each with 32GB of dedicated memory (except for the full-context results for which we used an API). Additionally, each node had a minimum of 100GB of CPU memory. |
| Software Dependencies | No | The paper mentions using 'Huggingface Transformers Library' and 'Hugging Face s Accelerate' for implementation, but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Equation 1 introduces the surprise threshold parameter γ... We initially explored our approach s sensitivity to γ using Mistral on the Long Bench benchmark. We evaluated the benchmark using surprise-only segmentation with γ 1.0, 1.5, 2.0, 2.5, 3.0, 3.5... We chose n = 1 for all experiments... Overall, there was a slight preference for kr = 0.3, which we therefore selected for the rest of our experiments. |