Towards Learning High-Precision Least Squares Algorithms with Sequence Models

Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher Re

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000 lower MSE than standard Transformers trained end-to-end and they incur a 10,000 smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.
Researcher Affiliation Academia Jerry Liu Institute of Computational & Mathematical Engineering Stanford University Stanford, CA, USA, Jessica Grogan , Atri Rudra Department of Computer Science & Engineering University at Buffalo Buffalo, NY, USA, Owen Dugan, Ashish Rao, Simran Arora, Chris RĂ© Department of Computer Science Stanford University Stanford, CA, USA
Pseudocode Yes Algorithm 1 BASECONV(u, W , B1, H, B2) Algorithm 2 circuit CP (x): Algorithm 3 Circuit for P(u):
Open Source Code Yes We provide all the code and configuration files necessary to reproduce our experiments at https://github.com/Hazy Research/precision-ls.
Open Datasets No In this work, all experiments are done using synthetic data and tasks.
Dataset Splits No At each training step, we produce a random training prompt uin by sampling each variable randomly: from the isotropic Gaussian distribution N(0, I) for continuous-valued parameters, and from the uniform distribution for discrete parameters.
Hardware Specification Yes All experiments were conducted using PyTorch on NVIDIA A100/H100 GPUs.
Software Dependencies No All experiments were conducted using PyTorch on NVIDIA A100/H100 GPUs.
Experiment Setup Yes We base our Transformer and BASECONV models off the GPT2 family (Radford et al., 2019). Unless otherwise specified, we use the following default settings for Transformers: Config Setting Embedding size 64 Number of layers 12 Number of heads 8 MLPs True MLP hidden size 4 embedding size MLP activation Re LU Layer Norms True Input dim 5 Sequence length 20 We describe two sets of optimizer settings we use throughout this work. Config Setting Batch size 256 Optimizer Adam Learning rate 10 3 Scheduler Step LR Training iterations 106 Step rate 104 Decay rate 0.9