A Solvable Attention for Neural Scaling Laws

Authors: Bochen Lyu, Di Wang, Zhanxing Zhu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper studies this intriguing phenomenon particularly for the transformer architecture in theoretical setups. Specifically, we propose a framework for linear self-attention, the underpinning block of transformer without softmax, to learn in an in-context manner, where the corresponding learning dynamics is modeled as a non-linear ordinary differential equation (ODE) system. Furthermore, we establish a procedure to derive a tractable approximate solution for this ODE system by reformulating it as a Riccati equation, which allows us to precisely characterize neural scaling laws for linear self-attention with training time, model size, data size, and the optimal compute. In addition, we reveal that the linear self-attention shares similar neural scaling laws with several other architectures when the context sequence length of the in-context learning is fixed, otherwise it would exhibit a different scaling law of training time.
Researcher Affiliation Collaboration Bochen Lyu1,2 Di Wang3 Zhanxing Zhu 1 1University of Southampton 2Data Canvas 3Independent Researcher
Pseudocode No The paper describes mathematical derivations and solution procedures for ODE systems but does not present any structured pseudocode or algorithm blocks. For example, Section 3.2 provides a "Procedure sketch" which is descriptive text.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to code repositories.
Open Datasets No The paper designs a "multitask sparse feature regression (MSFR) problem" and discusses "Generation of in-context data". This indicates the use of synthetically generated data based on their framework rather than a pre-existing, publicly available dataset with concrete access information.
Dataset Splits No The paper discusses generating "an in-context dataset with N data points" and uses terms like "training data points". However, it does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or references to predefined splits) in a way that would allow for reproduction of data partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the numerical experiments, such as GPU or CPU models, memory specifications, or cloud computing instances.
Software Dependencies No The paper does not list specific software dependencies with version numbers. It mentions optimization algorithms like "gradient descent in the continuous time limit" and "Adam W" but does not provide versioned software for these or any other libraries.
Experiment Setup Yes For the discrete GD training, we set the learning rate as 10 3 and the number of total optimization steps as 5000. The theoretical prediction using the solution f 0 s (t) is simulated with the forward Euler method such that t = kη where k is the optimization step and η is the learning rate. Table 4: Parameters for Adam W: learning rate η 5 10 3 β1 0.9 β2 0.999 weight decay 10 5