reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Solvable Attention for Neural Scaling Laws

Authors: Bochen Lyu, Di Wang, Zhanxing Zhu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper studies this intriguing phenomenon particularly for the transformer architecture in theoretical setups. Specifically, we propose a framework for linear self-attention, the underpinning block of transformer without softmax, to learn in an in-context manner, where the corresponding learning dynamics is modeled as a non-linear ordinary differential equation (ODE) system. Furthermore, we establish a procedure to derive a tractable approximate solution for this ODE system by reformulating it as a Riccati equation, which allows us to precisely characterize neural scaling laws for linear self-attention with training time, model size, data size, and the optimal compute. In addition, we reveal that the linear self-attention shares similar neural scaling laws with several other architectures when the context sequence length of the in-context learning is fixed, otherwise it would exhibit a different scaling law of training time.
Researcher Affiliation	Collaboration	Bochen Lyu1,2 Di Wang3 Zhanxing Zhu 1 1University of Southampton 2Data Canvas 3Independent Researcher
Pseudocode	No	The paper describes mathematical derivations and solution procedures for ODE systems but does not present any structured pseudocode or algorithm blocks. For example, Section 3.2 provides a "Procedure sketch" which is descriptive text.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to code repositories.
Open Datasets	No	The paper designs a "multitask sparse feature regression (MSFR) problem" and discusses "Generation of in-context data". This indicates the use of synthetically generated data based on their framework rather than a pre-existing, publicly available dataset with concrete access information.
Dataset Splits	No	The paper discusses generating "an in-context dataset with N data points" and uses terms like "training data points". However, it does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or references to predefined splits) in a way that would allow for reproduction of data partitioning.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the numerical experiments, such as GPU or CPU models, memory specifications, or cloud computing instances.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers. It mentions optimization algorithms like "gradient descent in the continuous time limit" and "Adam W" but does not provide versioned software for these or any other libraries.
Experiment Setup	Yes	For the discrete GD training, we set the learning rate as 10 3 and the number of total optimization steps as 5000. The theoretical prediction using the solution f 0 s (t) is simulated with the forward Euler method such that t = kη where k is the optimization step and η is the learning rate. Table 4: Parameters for Adam W: learning rate η 5 10 3 β1 0.9 β2 0.999 weight decay 10 5