Linear algebra with transformers
Authors: Francois Charton
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, I investigate the capability of transformers to learn to perform numerical computations with high accuracy. I focus on nine problems of linear algebra, from basic operations on dense matrices to inversion, eigen and singular value decomposition. I show that small transformers can be trained, from examples only, to compute approximate solutions (up to a few percents of the L1 norm) with more than 90% accuracy (over 99% in most cases). I propose and discuss four encodings to represent real numbers, and train small sequence to sequence transformers (up to 6 layers, 10 to 50 million trainable parameters) from generated datasets of random matrices. I investigate different architectures, in particular asymmetric configurations where the encoder or decoder has only one layer. Finally, I show that the models are robust to noisy data, and that they can generalize out of their training distribution if special attention is paid to training data generation. |
| Researcher Affiliation | Industry | François Charton Meta AI EMAIL |
| Pseudocode | No | The paper describes the problems and methods used in prose and through experimental results tables, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code for the model and experiments is available at github.com/facebookresearch/LAWT. |
| Open Datasets | No | For each problem, the training data is generated by sampling random input matrices I (see section 2.2), and computing the output O with a linear algebra package (Num Py linalg). All coefficients in I and O are set in base ten floating-point representation, and rounded to three significant digits in the mantissa. |
| Dataset Splits | Yes | At the end of every epoch (300,000 examples), a random test set (10,000 examples) is generated and model accuracy is evaluated. A predicted sequence is a correct solution to the problem (I, O) (I and O the input and output matrices) if it can be decoded as a valid matrix P and approximates the correct solution to a given tolerance τ. |
| Hardware Specification | Yes | All models are trained on an internal cluster, using NVIDIA Volta GPU with 32GB memory. |
| Software Dependencies | No | The paper mentions using a "linear algebra package (Num Py linalg)" and that the models "run in Python" for comparison, but it does not specify version numbers for NumPy or Python or any other key software dependencies. |
| Experiment Setup | Yes | All models use the transformer architecture from Vaswani et al. (2017): an encoder and a decoder connected by cross-attention. Models have 512 dimensions, 8 attention heads and up to 6 layers (experiments with larger models can be found in Appendix D.3). Training is supervised, minimizes the cross-entropy between model predictions and correct solutions, and uses the Adam optimiser (Kingma & Ba, 2014) with a learning rate of 10 4, a linear warm-up phase of 10,000 steps and cosine scheduling (Loshchilov & Hutter, 2016). Training data is generated on the fly in batches of 64. |