Human-inspired Episodic Memory for Infinite Context LLMs

Authors: Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the Long Bench and -Bench benchmarks demonstrate EM-LLM s superior performance, consistently outperforming the state-of-the-art retrieval model Inf LLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM s performance even surpasses full-context models in most tasks, while successfully performing retrieval across 10 million tokens a scale computationally infeasible for such models.
Researcher Affiliation Collaboration 1Huawei Noah s Ark Lab, London, UK 2AI Centre, Department of Computer Science, University College London, London, UK EMAIL EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1 Event segmentation in KV cache Input: tok: List of tokens in the sequence Input: T: Threshold for surprisal to identify initial boundaries Input: f: Metric function to evaluate potential boundaries Output: B: List of final boundary positions 1: B [i for i in range(length(tok)) if log(P(tok[i])) > T] Boundary identification 2: for i in range(length(B)) do 3: α, β = B[i], B[i + 1] 4: B[i + 1] arg max ˆβ (α,β] f(A, {α, ˆβ}) Boundary refinement 5: end for 6: return B
Open Source Code Yes Code available at: https://github.com/em-llm/EM-LLM-model
Open Datasets Yes Experiments on the Long Bench and -Bench benchmarks demonstrate EM-LLM s superior performance... To further prove our hypotheses, we then employ a series of human-annotated podcast scripts... Finally, using the long-context PG-19 dataset (Rae et al., 2020)...
Dataset Splits No The paper evaluates its performance on the Long Bench (Bai et al., 2023) and -Bench (Zhang et al., 2024) benchmarks. While these are established benchmarks, the paper does not explicitly state the specific training, validation, or test dataset splits used for its experiments within the main text.
Hardware Specification Yes All of our experiments were run on single nodes of 4 GPUs, each with 32GB of dedicated memory (except for the full-context results for which we used an API). Additionally, each node had a minimum of 100GB of CPU memory.
Software Dependencies No The paper mentions using 'Huggingface Transformers Library' and 'Hugging Face s Accelerate' for implementation, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Equation 1 introduces the surprise threshold parameter γ... We initially explored our approach s sensitivity to γ using Mistral on the Long Bench benchmark. We evaluated the benchmark using surprise-only segmentation with γ 1.0, 1.5, 2.0, 2.5, 3.0, 3.5... We chose n = 1 for all experiments... Overall, there was a slight preference for kr = 0.3, which we therefore selected for the rest of our experiments.