On the Optimal Memorization Capacity of Transformers

Authors: Tokio Kajitsuka, Issei Sato

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically investigate whether the memorization capacity of real-world Transformers aligns with the behavior predicted by our theoretical analysis when varying the size of the dataset and the length of input sequences. We trained Transformers in the next-token prediction setting on two real-world datasets and one randomly generated dataset of various sizes and evaluated the minimum network size required to memorize each dataset, plotting the results to examine the correlation between dataset size and network size.
Researcher Affiliation Academia Tokio Kajitsuka & Issei Sato Department of Computer Science The University of Tokyo EMAIL
Pseudocode No The paper primarily presents theoretical analysis, theorems, lemmas, and proofs using mathematical formulations. It does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code for procedures.
Open Source Code No The paper does not contain any explicit statements about releasing the source code for the methodology described, nor does it provide a link to a code repository. It mentions third-party tools like Optuna and AdamW, but not its own implementation code for the research.
Open Datasets Yes We trained Transformers in the next-token prediction setting on two real-world datasets: Multi NLI dataset (Williams et al., 2018) from GLUE benchmark (Wang et al., 2018) and IMDb dataset (Maas et al., 2011).
Dataset Splits No The paper mentions training on sampled datasets of specific sizes (e.g., 'datasets sampled from the Multi NLI dataset, where the sequence length was fixed at n = 8 and the dataset size N ranged from 600 to 1700 in increments of 100'). While it refers to 'training loss' and 'training error', it does not provide explicit details about how the datasets were split into training, validation, or test sets for reproducibility of the splitting methodology.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models, memory amounts, or detailed computer specifications. It only discusses model configurations and dataset characteristics in the experimental setup.
Software Dependencies No The model was trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with full-batch updates. To focus on the representational capacity of models and minimize the influence of optimization, we tuned hyperparameters such as a learning rate and warmup interval using Optuna (Akiba et al., 2019).
Experiment Setup Yes The model was trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with full-batch updates. To focus on the representational capacity of models and minimize the influence of optimization, we tuned hyperparameters such as a learning rate and warmup interval using Optuna (Akiba et al., 2019). Each model was trained using full-batch gradient descent for 1000 epochs, and the best-performing model was selected after running two trials of hyperparameter tuning with Optuna.