Universal Approximation of Mean-Field Models via Transformers

Authors: Shiba Biswal, Karthik Elamvazhuthi, Rishi Sonthalia

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we empirically demonstrate that transformers are well-suited for approximating a variety of mean field models, including the Cucker-Smale model for flocking and milling, and the mean-field system for training two-layer neural networks. We validate our numerical experiments via mathematical theory. Specifically, we prove that if a finite-dimensional transformer effectively approximates the finite-dimensional vector field governing the particle system, then the L2 distance between the expected transformer and the infinite-dimensional mean-field vector field can be uniformly bounded by a function of the number of particles observed during training.
Researcher Affiliation Academia 1Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM, USA 2Boston College, Boston, MA, USA. Correspondence to: Shiba Biswal <EMAIL>, Rishi Sonthalia <EMAIL>.
Pseudocode No The paper includes definitions for Multi-Headed Self-Attention and Transformer Network in Appendix D, but these are descriptive definitions rather than structured pseudocode or algorithm blocks for the overall methodology presented in the main paper.
Open Source Code Yes Code can be found at: https://github.com/rsonthal/Mean-Field-Transformers
Open Datasets Yes Our first goal focuses on learning the vector field F. Towards this, in this experiment, we use two datasets: first, a synthetic dataset generated from the Cucker-Smale model (Cucker & Smale, 2007), and second, real data of fish milling (Katz et al., 2021). ... Katz, Y., Kolbjørn, T., Christos C, I., Huepe, C., and Iain D, C. Fish schooling data subset: Oregon state university. https://ir.library.oregonstate. edu/concern/datasets/zk51vq07c, 2021.
Dataset Splits Yes The data so obtained are split into an 80-20-20 split for training, validation, and testing.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions using 'SciPy s solve ivp' and 'Adam' optimizer with 'cosine annealing' but does not specify version numbers for these or other software components.
Experiment Setup Yes Hyperparameters We consider depths in {3, 4, 5}, widths in {128, 256, 512}, and learning rates in {0.0002, 0.0001, 0.001}. ... We train the models using mini-batch Adam and a cosine annealing learning rate schedule. For the synthetics CS data, we used a batch size of 500 and trained the model for 1000 epochs. For the fish milling data, we used a batch size of 1, and trained the model for 10 epochs. ... We fix the transformer to have a hidden dimension of 512 and 5 layers. We train the model for 250 epochs, using a learning rate 0.0002, batch size of 1000, using Adam with cosine annealing.