reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Quantifying Memory Utilization with Effective State-Size

Authors: Rom Parnichkun, Neehal Tumma, Armin W Thomas, Alessandro Moro, Qi An, Taiji Suzuki, Atsushi Yamashita, Michael Poli, Stefano Massaroli

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate ESS beyond its theoretical interpretation by demonstrating its correlation with performance across a wide range of models and memory-intensive synthetic tasks, including associative recall and selective copying (Section 4). We explore the use of the ESS metric as a means of enhancing the performance-efficiency trade-off by demonstrating its ability to inform model distillation (Section 5.2), initialization schemes (Section 5.1), and regularization strategies (Section A).
Researcher Affiliation	Collaboration	1Liquid AI 2The University of Tokyo 3RIKEN 4Stanford University.
Pseudocode	Yes	D.1.3. Py Torch Implementation. Below, we provide a Py Torch implementation of various ESS metrics and helper functions that were leveraged in our analyses.
Open Source Code	Yes	An additional discussion comparing tolerance and entropy ESS, along with our code for computing them, can be found in Section D.1.
Open Datasets	Yes	To explore ESS in an extensive, yet controlled, manner, we iterate on a set of synthetic tasks proposed by Poli et al. which have been shown to effectively approximate model performance on large-scale language tasks. Specifically, we train models on the multi-query associative recall (MQAR), selective copying, and compression tasks. ... We observe a clear hierarchy in the degree of state modulation, which can be summarized as follows: SA > GLA > WLA > LA (Figure 42). The perplexity scores shown in Figure 9b were computed on 16k randomly sampled sequences over the Fine Web (Penedo et al., 2024) dataset. ... The task we tested this on is MMLU (elementary mathematics).
Dataset Splits	Yes	For each task-model configuration, we compute the ESS and accuracy on a validation set every 10 epochs. ... Num. Training Samples 128k Num. Testing Samples 6.4k (Table 1)
Hardware Specification	No	The paper mentions "advancements in hardware accelerators such as GPUs" in the introduction but does not specify any particular models or configurations used for their experiments.
Software Dependencies	No	Section D.1.3 provides PyTorch code using `import torch`. However, no specific version numbers for PyTorch or any other software libraries are mentioned in the paper.
Experiment Setup	Yes	Table 1. Set of hyperparameters for the task-model sweep. ... Table 4. Default MQAR task settings employed throughout the featurizer and initialization experiments in Section 5.1. ... Table 9. 1B LLM settings.