Linear algebra with transformers

Authors: Francois Charton

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, I investigate the capability of transformers to learn to perform numerical computations with high accuracy. I focus on nine problems of linear algebra, from basic operations on dense matrices to inversion, eigen and singular value decomposition. I show that small transformers can be trained, from examples only, to compute approximate solutions (up to a few percents of the L1 norm) with more than 90% accuracy (over 99% in most cases). I propose and discuss four encodings to represent real numbers, and train small sequence to sequence transformers (up to 6 layers, 10 to 50 million trainable parameters) from generated datasets of random matrices. I investigate different architectures, in particular asymmetric configurations where the encoder or decoder has only one layer. Finally, I show that the models are robust to noisy data, and that they can generalize out of their training distribution if special attention is paid to training data generation.
Researcher Affiliation Industry François Charton Meta AI EMAIL
Pseudocode No The paper describes the problems and methods used in prose and through experimental results tables, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code for the model and experiments is available at github.com/facebookresearch/LAWT.
Open Datasets No For each problem, the training data is generated by sampling random input matrices I (see section 2.2), and computing the output O with a linear algebra package (Num Py linalg). All coefficients in I and O are set in base ten floating-point representation, and rounded to three significant digits in the mantissa.
Dataset Splits Yes At the end of every epoch (300,000 examples), a random test set (10,000 examples) is generated and model accuracy is evaluated. A predicted sequence is a correct solution to the problem (I, O) (I and O the input and output matrices) if it can be decoded as a valid matrix P and approximates the correct solution to a given tolerance τ.
Hardware Specification Yes All models are trained on an internal cluster, using NVIDIA Volta GPU with 32GB memory.
Software Dependencies No The paper mentions using a "linear algebra package (Num Py linalg)" and that the models "run in Python" for comparison, but it does not specify version numbers for NumPy or Python or any other key software dependencies.
Experiment Setup Yes All models use the transformer architecture from Vaswani et al. (2017): an encoder and a decoder connected by cross-attention. Models have 512 dimensions, 8 attention heads and up to 6 layers (experiments with larger models can be found in Appendix D.3). Training is supervised, minimizes the cross-entropy between model predictions and correct solutions, and uses the Adam optimiser (Kingma & Ba, 2014) with a learning rate of 10 4, a linear warm-up phase of 10,000 steps and cosine scheduling (Loshchilov & Hutter, 2016). Training data is generated on the fly in batches of 64.