reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Human-inspired Episodic Memory for Infinite Context LLMs

Authors: Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the Long Bench and -Bench benchmarks demonstrate EM-LLM s superior performance, consistently outperforming the state-of-the-art retrieval model Inf LLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM s performance even surpasses full-context models in most tasks, while successfully performing retrieval across 10 million tokens a scale computationally infeasible for such models.
Researcher Affiliation	Collaboration	1Huawei Noah s Ark Lab, London, UK 2AI Centre, Department of Computer Science, University College London, London, UK EMAIL EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Event segmentation in KV cache Input: tok: List of tokens in the sequence Input: T: Threshold for surprisal to identify initial boundaries Input: f: Metric function to evaluate potential boundaries Output: B: List of final boundary positions 1: B [i for i in range(length(tok)) if log(P(tok[i])) > T] Boundary identification 2: for i in range(length(B)) do 3: α, β = B[i], B[i + 1] 4: B[i + 1] arg max ˆβ (α,β] f(A, {α, ˆβ}) Boundary refinement 5: end for 6: return B
Open Source Code	Yes	Code available at: https://github.com/em-llm/EM-LLM-model
Open Datasets	Yes	Experiments on the Long Bench and -Bench benchmarks demonstrate EM-LLM s superior performance... To further prove our hypotheses, we then employ a series of human-annotated podcast scripts... Finally, using the long-context PG-19 dataset (Rae et al., 2020)...
Dataset Splits	No	The paper evaluates its performance on the Long Bench (Bai et al., 2023) and -Bench (Zhang et al., 2024) benchmarks. While these are established benchmarks, the paper does not explicitly state the specific training, validation, or test dataset splits used for its experiments within the main text.
Hardware Specification	Yes	All of our experiments were run on single nodes of 4 GPUs, each with 32GB of dedicated memory (except for the full-context results for which we used an API). Additionally, each node had a minimum of 100GB of CPU memory.
Software Dependencies	No	The paper mentions using 'Huggingface Transformers Library' and 'Hugging Face s Accelerate' for implementation, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	Equation 1 introduces the surprise threshold parameter γ... We initially explored our approach s sensitivity to γ using Mistral on the Long Bench benchmark. We evaluated the benchmark using surprise-only segmentation with γ 1.0, 1.5, 2.0, 2.5, 3.0, 3.5... We chose n = 1 for all experiments... Overall, there was a slight preference for kr = 0.3, which we therefore selected for the rest of our experiments.