MELODI: Exploring Memory Compression for Long Contexts

Authors: Yinpeng Chen, DeLesley Hutchins, Aren Jansen, Andrey Zhmoginov, David Racz, Jesper Andersen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental MELODI demonstrates strong performance on various long-context datasets. For instance, utilizing a 13-layer transformer network with 1024 embedding dimensions and 512-token context windows, MELODI achieves perplexity scores of 10.44 and 2.11 on PG-19 (T5 vocabulary) and ar Xiv Math (Meena vocabulary), respectively. This represents a clear improvement over the Memorizing Transformer (10.62 on PG-19, 2.14 on ar Xiv) with dense attention (as opposed to top-k attention), while significantly reducing memory usage by a factor of 8. Furthermore, ablation studies confirm the complementary nature of short-term and long-term memory in MELODI, highlighting their synergistic contribution to an efficient and effective memory architecture.
Researcher Affiliation Industry Yinpeng Chen De Lesley Hutchins Aren Jansen Andrey Zhmoginov David Racz Jesper Andersen Google Deep Mind EMAIL
Pseudocode No The paper describes the MELODI architecture and its components (short-term and long-term memory) using prose, figures, and mathematical notation (e.g., Equation 1). However, it does not present any explicit pseudocode blocks or algorithms with structured, code-like formatting.
Open Source Code No The paper states: "All models were implemented in JAX and Flax". However, it does not provide an explicit statement about releasing the source code for MELODI, nor does it include a link to a code repository.
Open Datasets Yes PG19: The PG19 dataset (Rae et al., 2019) consists of 28,752 English books published before 1919... ar Xiv Math: The ar Xiv dataset (Wu et al., 2022) comprises technical math papers from ar Xiv... C4(4K+): The C4 dataset (Raffel et al., 2020a) is a large collection of internet documents.
Dataset Splits No The paper states: "We report the average perplexity on the respective test sets as our evaluation metric." and "During training, each long document was segmented into 4096-token chunks to facilitate batch processing. These chunks were then organized into training batches, each comprising 8 context windows of 512 tokens." However, it does not explicitly provide details about the train/validation/test splits, such as percentages, sample counts, or a citation to a standard split methodology.
Hardware Specification Yes All models were implemented in JAX and Flax and trained from scratch for 500k steps on 32 TPU cores. ... We measured the training and test times of 12-layer networks with an embedding dimension of 1024 on TPU v6e.
Software Dependencies No The paper mentions: "All models were implemented in JAX and Flax". However, it does not specify version numbers for these software components or any other libraries used.
Experiment Setup Yes We use Adafactor optimizer (Shazeer & Stern, 2018) with a learning rate schedule that employs a linear warmup for the first 1000 steps, followed by cosine decay. The maximum and minimum learning rates are set to 0.01 and 0.001, respectively, as recommended in Hoffmann et al. (2022). A dropout rate of 0.05 is applied. All models are trained for 500k steps (200k for ablations) on 32 TPU cores with a batch size of 32 (1 per core).