Geometry of Lightning Self-Attention: Identifiability and Dimension

Authors: Nathan Henry, Giovanni Luca Marchetti, Kathlén Kohn

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical evidence for Conjecture 3.10. To this end, we implement a deep attention network with softmax normalization (i.e., S(x) = ex), and estimate the dimension of its neuromanifold. The results are visualized in Figure 3 for a deep attention network with l = 2 layers, t = 3, ai = 2 for all i, and di = δ varying from 3 to 10. The plot shows both the dimension estimated via the numerical approach ( Estimated ) and the one computed via Equation 16 ( Expected ). The two values coincide for all δ, confirming Conjecture 3.10 empirically.
Researcher Affiliation Academia Nathan W. Henry * University of Toronto EMAIL Giovanni Luca Marchetti * Royal Institute of Technology (KTH) EMAIL Kathl en Kohn * Royal Institute of Technology (KTH) EMAIL
Pseudocode No The paper describes mathematical proofs and derivations using equations and prose, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our Python code is available at a public repository4. https://github.com/giovanni-marchetti/NeuroDim
Open Datasets No To this end, we implement a deep attention network with softmax normalization (i.e., S(x) = ex), and estimate the dimension of its neuromanifold. The latter is a subtle problem since, differently from the lightning case, the neuromanifold is not a priori embedded in a finite-dimensional vector space. Therefore, we rely on a stochastic finite element approach by randomly sampling N = 250 input points in Rd0 t from a normal distribution and restricting φW to this finite space.
Dataset Splits No Therefore, we rely on a stochastic finite element approach by randomly sampling N = 250 input points in Rd0 t from a normal distribution and restricting φW to this finite space. This text describes the generation of synthetic data, not the splitting of an existing dataset into training, validation, or test sets.
Hardware Specification No The paper does not mention any specific hardware (e.g., CPU, GPU models, or cloud computing resources) used for running the numerical verifications.
Software Dependencies No The paper mentions 'Our Python code is available at a public repository' but does not specify any particular software versions (e.g., Python version, library versions like PyTorch, TensorFlow, etc.) used in the implementation.
Experiment Setup Yes The results are visualized in Figure 3 for a deep attention network with l = 2 layers, t = 3, ai = 2 for all i, and di = δ varying from 3 to 10.