Universal Approximation of Mean-Field Models via Transformers
Authors: Shiba Biswal, Karthik Elamvazhuthi, Rishi Sonthalia
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we empirically demonstrate that transformers are well-suited for approximating a variety of mean field models, including the Cucker-Smale model for flocking and milling, and the mean-field system for training two-layer neural networks. We validate our numerical experiments via mathematical theory. Specifically, we prove that if a finite-dimensional transformer effectively approximates the finite-dimensional vector field governing the particle system, then the L2 distance between the expected transformer and the infinite-dimensional mean-field vector field can be uniformly bounded by a function of the number of particles observed during training. |
| Researcher Affiliation | Academia | 1Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM, USA 2Boston College, Boston, MA, USA. Correspondence to: Shiba Biswal <EMAIL>, Rishi Sonthalia <EMAIL>. |
| Pseudocode | No | The paper includes definitions for Multi-Headed Self-Attention and Transformer Network in Appendix D, but these are descriptive definitions rather than structured pseudocode or algorithm blocks for the overall methodology presented in the main paper. |
| Open Source Code | Yes | Code can be found at: https://github.com/rsonthal/Mean-Field-Transformers |
| Open Datasets | Yes | Our first goal focuses on learning the vector field F. Towards this, in this experiment, we use two datasets: first, a synthetic dataset generated from the Cucker-Smale model (Cucker & Smale, 2007), and second, real data of fish milling (Katz et al., 2021). ... Katz, Y., Kolbjørn, T., Christos C, I., Huepe, C., and Iain D, C. Fish schooling data subset: Oregon state university. https://ir.library.oregonstate. edu/concern/datasets/zk51vq07c, 2021. |
| Dataset Splits | Yes | The data so obtained are split into an 80-20-20 split for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using 'SciPy s solve ivp' and 'Adam' optimizer with 'cosine annealing' but does not specify version numbers for these or other software components. |
| Experiment Setup | Yes | Hyperparameters We consider depths in {3, 4, 5}, widths in {128, 256, 512}, and learning rates in {0.0002, 0.0001, 0.001}. ... We train the models using mini-batch Adam and a cosine annealing learning rate schedule. For the synthetics CS data, we used a batch size of 500 and trained the model for 1000 epochs. For the fish milling data, we used a batch size of 1, and trained the model for 10 epochs. ... We fix the transformer to have a hidden dimension of 512 and 5 layers. We train the model for 250 epochs, using a learning rate 0.0002, batch size of 1000, using Adam with cosine annealing. |