reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Learning High-Precision Least Squares Algorithms with Sequence Models

Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher Re

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000 lower MSE than standard Transformers trained end-to-end and they incur a 10,000 smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.
Researcher Affiliation	Academia	Jerry Liu Institute of Computational & Mathematical Engineering Stanford University Stanford, CA, USA, Jessica Grogan , Atri Rudra Department of Computer Science & Engineering University at Buffalo Buffalo, NY, USA, Owen Dugan, Ashish Rao, Simran Arora, Chris Ré Department of Computer Science Stanford University Stanford, CA, USA
Pseudocode	Yes	Algorithm 1 BASECONV(u, W , B1, H, B2) Algorithm 2 circuit CP (x): Algorithm 3 Circuit for P(u):
Open Source Code	Yes	We provide all the code and configuration files necessary to reproduce our experiments at https://github.com/Hazy Research/precision-ls.
Open Datasets	No	In this work, all experiments are done using synthetic data and tasks.
Dataset Splits	No	At each training step, we produce a random training prompt uin by sampling each variable randomly: from the isotropic Gaussian distribution N(0, I) for continuous-valued parameters, and from the uniform distribution for discrete parameters.
Hardware Specification	Yes	All experiments were conducted using PyTorch on NVIDIA A100/H100 GPUs.
Software Dependencies	No	All experiments were conducted using PyTorch on NVIDIA A100/H100 GPUs.
Experiment Setup	Yes	We base our Transformer and BASECONV models off the GPT2 family (Radford et al., 2019). Unless otherwise specified, we use the following default settings for Transformers: Config Setting Embedding size 64 Number of layers 12 Number of heads 8 MLPs True MLP hidden size 4 embedding size MLP activation Re LU Layer Norms True Input dim 5 Sequence length 20 We describe two sets of optimizer settings we use throughout this work. Config Setting Batch size 256 Optimizer Adam Learning rate 10 3 Scheduler Step LR Training iterations 106 Step rate 104 Decay rate 0.9