Customizing the Inductive Biases of Softmax Attention using Structured Matrices

Authors: Yilun Kuang, Noah Amsel, Sanae Lotfi, Shikai Qiu, Andres Potapczynski, Andrew Gordon Wilson

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On in-context regression tasks with high-dimensional inputs, our proposed scoring functions outperform standard attention for any fixed compute budget. On language modeling, a task that exhibits locality patterns, our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention. Additionally, we show that both BTT and MLR fall under a broader family of efficient structured matrices capable of encoding either full-rank or distance-dependent compute biases, thereby addressing significant shortcomings of standard attention. ... In Figure 2, we demonstrate this unfavorable trade-off for in-context regression, replicating the setting from Garg et al. (2022). ... In Section 4, we train transformers with these structured scoring functions on in-context regression. ... In Sections 3.4 and 5, we use MLR matrices to introduce a distance-dependent compute bias, which slightly outperforms previous methods in language modeling and time series forecasting.
Researcher Affiliation Academia 1New York University. Correspondence to: Yilun Kuang <EMAIL>, Andrew Gordon Wilson <EMAIL>.
Pseudocode No The paper describes mathematical formulations and derivations, for example, in Section 3.1 describing structured matrix families with definitions and properties, and in Section 3.4 for MLR Attention derivation. However, it does not include any explicitly labeled pseudocode or algorithm blocks. The methods are described through equations and textual explanations.
Open Source Code Yes Our codes are available at the following github repository https://github.com/Yilun Kuang/structured-attention.
Open Datasets Yes We train 6-layer transformers with both standard attention and MLR attention on the Open Web Text dataset... (Section 5.1) ... The Electricity Transformer Temperature (ETT) dataset (Zhou et al., 2021) tracks fluctuations of oil temperature along with six additional power load features across time. (Section 5.2) ... Finally, we substitute the attention mechanism in the foundational model of Chronos Ansari et al. (2024). (Appendix J)
Dataset Splits No For the Open Web Text dataset (Section 5.1), the paper mentions "sequence length T = 1024" but does not specify how the dataset is split into training, validation, or test sets. For in-context regression (Section 4), it mentions "N = 2dinput" for prompts and training causally on a loss, which describes the task setup rather than explicit train/test/validation splits. For the ETT dataset (Section 5.2 and Appendix J), it mentions "horizons of T {96, 192, 336} hours" but no explicit train/test/validation splits.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications. It mentions "compute budget" in various sections but without naming any specific hardware.
Software Dependencies No The paper mentions using "Adam W (Loshchilov & Hutter, 2019)" and tuning hyperparameters based on "µP (Yang et al., 2022)", as well as "Ray Tune (Liaw et al., 2018)" in Appendix J. While these are software tools/methods, specific version numbers for libraries, frameworks (like PyTorch or TensorFlow), or Python itself are not provided.
Experiment Setup Yes In Section 5.1, for language modeling, it states: "We train 6-layer transformers with both standard attention and MLR attention on the Open Web Text dataset with a batch size of 4, sequence length T = 1024, head dimension r = 64, and model width D {256, 384, 512, 768}. The model is trained with Adam W (Loshchilov & Hutter, 2019) and we tune hyperparameters based on µP (Yang et al., 2022)." In Section 4, for in-context regression: "We train 6-layer transformers with 8 heads and varying embedding and head dimensions..." Appendix J, for Time Series Forecasting: "We train a transformer model with 2 encoder layers, embedding dimension D = 512, and 8 attention heads on the ETTh1 subset... We use Ray Tune (Liaw et al., 2018) to sweep over optimal learning rates in the range of {0.00001, 0.0002, 0.005} and SGD iterations in {200, 1000}."