reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Optimal Memorization Capacity of Transformers

Authors: Tokio Kajitsuka, Issei Sato

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically investigate whether the memorization capacity of real-world Transformers aligns with the behavior predicted by our theoretical analysis when varying the size of the dataset and the length of input sequences. We trained Transformers in the next-token prediction setting on two real-world datasets and one randomly generated dataset of various sizes and evaluated the minimum network size required to memorize each dataset, plotting the results to examine the correlation between dataset size and network size.
Researcher Affiliation	Academia	Tokio Kajitsuka & Issei Sato Department of Computer Science The University of Tokyo EMAIL
Pseudocode	No	The paper primarily presents theoretical analysis, theorems, lemmas, and proofs using mathematical formulations. It does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code for procedures.
Open Source Code	No	The paper does not contain any explicit statements about releasing the source code for the methodology described, nor does it provide a link to a code repository. It mentions third-party tools like Optuna and AdamW, but not its own implementation code for the research.
Open Datasets	Yes	We trained Transformers in the next-token prediction setting on two real-world datasets: Multi NLI dataset (Williams et al., 2018) from GLUE benchmark (Wang et al., 2018) and IMDb dataset (Maas et al., 2011).
Dataset Splits	No	The paper mentions training on sampled datasets of specific sizes (e.g., 'datasets sampled from the Multi NLI dataset, where the sequence length was fixed at n = 8 and the dataset size N ranged from 600 to 1700 in increments of 100'). While it refers to 'training loss' and 'training error', it does not provide explicit details about how the datasets were split into training, validation, or test sets for reproducibility of the splitting methodology.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models, memory amounts, or detailed computer specifications. It only discusses model configurations and dataset characteristics in the experimental setup.
Software Dependencies	No	The model was trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with full-batch updates. To focus on the representational capacity of models and minimize the influence of optimization, we tuned hyperparameters such as a learning rate and warmup interval using Optuna (Akiba et al., 2019).
Experiment Setup	Yes	The model was trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with full-batch updates. To focus on the representational capacity of models and minimize the influence of optimization, we tuned hyperparameters such as a learning rate and warmup interval using Optuna (Akiba et al., 2019). Each model was trained using full-batch gradient descent for 1000 epochs, and the best-performing model was selected after running two trials of hyperparameter tuning with Optuna.