Quantifying Memory Utilization with Effective State-Size
Authors: Rom Parnichkun, Neehal Tumma, Armin W Thomas, Alessandro Moro, Qi An, Taiji Suzuki, Atsushi Yamashita, Michael Poli, Stefano Massaroli
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate ESS beyond its theoretical interpretation by demonstrating its correlation with performance across a wide range of models and memory-intensive synthetic tasks, including associative recall and selective copying (Section 4). We explore the use of the ESS metric as a means of enhancing the performance-efficiency trade-off by demonstrating its ability to inform model distillation (Section 5.2), initialization schemes (Section 5.1), and regularization strategies (Section A). |
| Researcher Affiliation | Collaboration | 1Liquid AI 2The University of Tokyo 3RIKEN 4Stanford University. |
| Pseudocode | Yes | D.1.3. Py Torch Implementation. Below, we provide a Py Torch implementation of various ESS metrics and helper functions that were leveraged in our analyses. |
| Open Source Code | Yes | An additional discussion comparing tolerance and entropy ESS, along with our code for computing them, can be found in Section D.1. |
| Open Datasets | Yes | To explore ESS in an extensive, yet controlled, manner, we iterate on a set of synthetic tasks proposed by Poli et al. which have been shown to effectively approximate model performance on large-scale language tasks. Specifically, we train models on the multi-query associative recall (MQAR), selective copying, and compression tasks. ... We observe a clear hierarchy in the degree of state modulation, which can be summarized as follows: SA > GLA > WLA > LA (Figure 42). The perplexity scores shown in Figure 9b were computed on 16k randomly sampled sequences over the Fine Web (Penedo et al., 2024) dataset. ... The task we tested this on is MMLU (elementary mathematics). |
| Dataset Splits | Yes | For each task-model configuration, we compute the ESS and accuracy on a validation set every 10 epochs. ... Num. Training Samples 128k Num. Testing Samples 6.4k (Table 1) |
| Hardware Specification | No | The paper mentions "advancements in hardware accelerators such as GPUs" in the introduction but does not specify any particular models or configurations used for their experiments. |
| Software Dependencies | No | Section D.1.3 provides PyTorch code using `import torch`. However, no specific version numbers for PyTorch or any other software libraries are mentioned in the paper. |
| Experiment Setup | Yes | Table 1. Set of hyperparameters for the task-model sweep. ... Table 4. Default MQAR task settings employed throughout the featurizer and initialization experiments in Section 5.1. ... Table 9. 1B LLM settings. |