On the Optimal Memorization Capacity of Transformers
Authors: Tokio Kajitsuka, Issei Sato
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically investigate whether the memorization capacity of real-world Transformers aligns with the behavior predicted by our theoretical analysis when varying the size of the dataset and the length of input sequences. We trained Transformers in the next-token prediction setting on two real-world datasets and one randomly generated dataset of various sizes and evaluated the minimum network size required to memorize each dataset, plotting the results to examine the correlation between dataset size and network size. |
| Researcher Affiliation | Academia | Tokio Kajitsuka & Issei Sato Department of Computer Science The University of Tokyo EMAIL |
| Pseudocode | No | The paper primarily presents theoretical analysis, theorems, lemmas, and proofs using mathematical formulations. It does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code for procedures. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing the source code for the methodology described, nor does it provide a link to a code repository. It mentions third-party tools like Optuna and AdamW, but not its own implementation code for the research. |
| Open Datasets | Yes | We trained Transformers in the next-token prediction setting on two real-world datasets: Multi NLI dataset (Williams et al., 2018) from GLUE benchmark (Wang et al., 2018) and IMDb dataset (Maas et al., 2011). |
| Dataset Splits | No | The paper mentions training on sampled datasets of specific sizes (e.g., 'datasets sampled from the Multi NLI dataset, where the sequence length was fixed at n = 8 and the dataset size N ranged from 600 to 1700 in increments of 100'). While it refers to 'training loss' and 'training error', it does not provide explicit details about how the datasets were split into training, validation, or test sets for reproducibility of the splitting methodology. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models, memory amounts, or detailed computer specifications. It only discusses model configurations and dataset characteristics in the experimental setup. |
| Software Dependencies | No | The model was trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with full-batch updates. To focus on the representational capacity of models and minimize the influence of optimization, we tuned hyperparameters such as a learning rate and warmup interval using Optuna (Akiba et al., 2019). |
| Experiment Setup | Yes | The model was trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with full-batch updates. To focus on the representational capacity of models and minimize the influence of optimization, we tuned hyperparameters such as a learning rate and warmup interval using Optuna (Akiba et al., 2019). Each model was trained using full-batch gradient descent for 1000 epochs, and the best-performing model was selected after running two trials of hyperparameter tuning with Optuna. |