Towards Learning High-Precision Least Squares Algorithms with Sequence Models
Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher Re
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000 lower MSE than standard Transformers trained end-to-end and they incur a 10,000 smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares. |
| Researcher Affiliation | Academia | Jerry Liu Institute of Computational & Mathematical Engineering Stanford University Stanford, CA, USA, Jessica Grogan , Atri Rudra Department of Computer Science & Engineering University at Buffalo Buffalo, NY, USA, Owen Dugan, Ashish Rao, Simran Arora, Chris RĂ© Department of Computer Science Stanford University Stanford, CA, USA |
| Pseudocode | Yes | Algorithm 1 BASECONV(u, W , B1, H, B2) Algorithm 2 circuit CP (x): Algorithm 3 Circuit for P(u): |
| Open Source Code | Yes | We provide all the code and configuration files necessary to reproduce our experiments at https://github.com/Hazy Research/precision-ls. |
| Open Datasets | No | In this work, all experiments are done using synthetic data and tasks. |
| Dataset Splits | No | At each training step, we produce a random training prompt uin by sampling each variable randomly: from the isotropic Gaussian distribution N(0, I) for continuous-valued parameters, and from the uniform distribution for discrete parameters. |
| Hardware Specification | Yes | All experiments were conducted using PyTorch on NVIDIA A100/H100 GPUs. |
| Software Dependencies | No | All experiments were conducted using PyTorch on NVIDIA A100/H100 GPUs. |
| Experiment Setup | Yes | We base our Transformer and BASECONV models off the GPT2 family (Radford et al., 2019). Unless otherwise specified, we use the following default settings for Transformers: Config Setting Embedding size 64 Number of layers 12 Number of heads 8 MLPs True MLP hidden size 4 embedding size MLP activation Re LU Layer Norms True Input dim 5 Sequence length 20 We describe two sets of optimizer settings we use throughout this work. Config Setting Batch size 256 Optimizer Adam Learning rate 10 3 Scheduler Step LR Training iterations 106 Step rate 104 Decay rate 0.9 |