reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MELODI: Exploring Memory Compression for Long Contexts

Authors: Yinpeng Chen, DeLesley Hutchins, Aren Jansen, Andrey Zhmoginov, David Racz, Jesper Andersen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	MELODI demonstrates strong performance on various long-context datasets. For instance, utilizing a 13-layer transformer network with 1024 embedding dimensions and 512-token context windows, MELODI achieves perplexity scores of 10.44 and 2.11 on PG-19 (T5 vocabulary) and ar Xiv Math (Meena vocabulary), respectively. This represents a clear improvement over the Memorizing Transformer (10.62 on PG-19, 2.14 on ar Xiv) with dense attention (as opposed to top-k attention), while signiﬁcantly reducing memory usage by a factor of 8. Furthermore, ablation studies conﬁrm the complementary nature of short-term and long-term memory in MELODI, highlighting their synergistic contribution to an efﬁcient and effective memory architecture.
Researcher Affiliation	Industry	Yinpeng Chen De Lesley Hutchins Aren Jansen Andrey Zhmoginov David Racz Jesper Andersen Google Deep Mind EMAIL
Pseudocode	No	The paper describes the MELODI architecture and its components (short-term and long-term memory) using prose, figures, and mathematical notation (e.g., Equation 1). However, it does not present any explicit pseudocode blocks or algorithms with structured, code-like formatting.
Open Source Code	No	The paper states: "All models were implemented in JAX and Flax". However, it does not provide an explicit statement about releasing the source code for MELODI, nor does it include a link to a code repository.
Open Datasets	Yes	PG19: The PG19 dataset (Rae et al., 2019) consists of 28,752 English books published before 1919... ar Xiv Math: The ar Xiv dataset (Wu et al., 2022) comprises technical math papers from ar Xiv... C4(4K+): The C4 dataset (Raffel et al., 2020a) is a large collection of internet documents.
Dataset Splits	No	The paper states: "We report the average perplexity on the respective test sets as our evaluation metric." and "During training, each long document was segmented into 4096-token chunks to facilitate batch processing. These chunks were then organized into training batches, each comprising 8 context windows of 512 tokens." However, it does not explicitly provide details about the train/validation/test splits, such as percentages, sample counts, or a citation to a standard split methodology.
Hardware Specification	Yes	All models were implemented in JAX and Flax and trained from scratch for 500k steps on 32 TPU cores. ... We measured the training and test times of 12-layer networks with an embedding dimension of 1024 on TPU v6e.
Software Dependencies	No	The paper mentions: "All models were implemented in JAX and Flax". However, it does not specify version numbers for these software components or any other libraries used.
Experiment Setup	Yes	We use Adafactor optimizer (Shazeer & Stern, 2018) with a learning rate schedule that employs a linear warmup for the ﬁrst 1000 steps, followed by cosine decay. The maximum and minimum learning rates are set to 0.01 and 0.001, respectively, as recommended in Hoffmann et al. (2022). A dropout rate of 0.05 is applied. All models are trained for 500k steps (200k for ablations) on 32 TPU cores with a batch size of 32 (1 per core).