(How) Can Transformers Predict Pseudo-Random Numbers?

Authors: Tao Tao, Darshil Doshi, Dayal Singh Kalra, Tianyu He, Maissam Barkeshli

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs)... We find that with sufficient architectural capacity and training data variety, Transformers can perform in-context prediction of LCG sequences... By analyzing the embedding layers and attention patterns, we uncover how Transformers develop algorithmic structures... We also perform several systematic scaling analyses to understand the effect of architecture and sequence complexity on model performance and in-context learning ability.
Researcher Affiliation Academia 1Department of Physics, University of Maryland, College Park, USA 2Department of Computer Science, University of Maryland, College Park, USA 3Joint Quantum Institute, University of Maryland, College Park, USA. EMAIL
Pseudocode No No explicit pseudocode or algorithm blocks are provided. The paper describes algorithms qualitatively in sections like 'Qualitative Algorithm (fixed modulus)' but without structured, code-like formatting.
Open Source Code Yes We open source the code to reproduce our results: https://github.com/dayal-kalra/transformer-prng.git
Open Datasets No The paper describes in detail how the LCG sequences are generated for training and testing in Section 2.1 "Dataset Generation and Evaluation". While the code to reproduce these results (and thus the data generation) is open-sourced, there is no explicit provision of the generated datasets themselves via a direct link, DOI, or repository for the datasets.
Dataset Splits Yes Fixed Modulus (FM): Given a modulus m, we apply the Hull-Dobell Theorem to determine the possible values of (a, c) that maximize the period. We then randomly select 64 values of a and c to generate the test dataset. To generate the training dataset, we exlude these test choices of (a, c) and uniformly sample N = 100, 000 LCG sequences of length L + 1 (where L is the context length), with na values of multipliers and nc values of increments. [...] Generalization to Unseen Modulus (UM): In this more challenging paradigm, we first select a set of test moduli Mtest = {mtest} that would be reserved exclusively for evaluation. For each test modulus mtest Mtest, we determine the values of (a, c) that maximize the period. We then randomly select 64 values of a and c each to generate the test dataset. These 642 (a, c) pairs are not considered while generating the training dataset.
Hardware Specification Yes In Figure 10, the training of the m = 232 model was conducted using four NVIDIA A100 GPUs, requiring a total of 21.82 hours. The m = 216 model completed training in 4.83 hours under the same hardware setup. [...] In Figure 11, both models were trained on a single NVIDIA H100 GPU for 22 hours.
Software Dependencies No The paper mentions using Adam W optimizer and Cross Entropy loss, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries.
Experiment Setup Yes We train the models with Cross-entropy loss using Adam W optimizer (Loshchilov & Hutter, 2019) with momentum hyperparameters β1 = 0.9 and β2 = 0.99. We implement a linear learning rate warmup over the first 2048 steps with an initial learning rate of zero and the target learning rate η. By default, all experiments employ a batch size of 256. Weight decay is only applied to non-bias parameters. In the unseen modulus case, we observed that both the optimal target learning rate η and weight decay strength λ are highly sensitive to minute changes in training dataset properties... To determine the optimal hyperpamraters, we scan the learning rates η {3e-05, 1e-04, 3e-04, 1e-03} and weight decay strengths λ {0.01, 0.1, 1.0, 3.0}.