Geometry of Lightning Self-Attention: Identifiability and Dimension
Authors: Nathan Henry, Giovanni Luca Marchetti, Kathlén Kohn
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide empirical evidence for Conjecture 3.10. To this end, we implement a deep attention network with softmax normalization (i.e., S(x) = ex), and estimate the dimension of its neuromanifold. The results are visualized in Figure 3 for a deep attention network with l = 2 layers, t = 3, ai = 2 for all i, and di = δ varying from 3 to 10. The plot shows both the dimension estimated via the numerical approach ( Estimated ) and the one computed via Equation 16 ( Expected ). The two values coincide for all δ, confirming Conjecture 3.10 empirically. |
| Researcher Affiliation | Academia | Nathan W. Henry * University of Toronto EMAIL Giovanni Luca Marchetti * Royal Institute of Technology (KTH) EMAIL Kathl en Kohn * Royal Institute of Technology (KTH) EMAIL |
| Pseudocode | No | The paper describes mathematical proofs and derivations using equations and prose, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Our Python code is available at a public repository4. https://github.com/giovanni-marchetti/NeuroDim |
| Open Datasets | No | To this end, we implement a deep attention network with softmax normalization (i.e., S(x) = ex), and estimate the dimension of its neuromanifold. The latter is a subtle problem since, differently from the lightning case, the neuromanifold is not a priori embedded in a finite-dimensional vector space. Therefore, we rely on a stochastic finite element approach by randomly sampling N = 250 input points in Rd0 t from a normal distribution and restricting φW to this finite space. |
| Dataset Splits | No | Therefore, we rely on a stochastic finite element approach by randomly sampling N = 250 input points in Rd0 t from a normal distribution and restricting φW to this finite space. This text describes the generation of synthetic data, not the splitting of an existing dataset into training, validation, or test sets. |
| Hardware Specification | No | The paper does not mention any specific hardware (e.g., CPU, GPU models, or cloud computing resources) used for running the numerical verifications. |
| Software Dependencies | No | The paper mentions 'Our Python code is available at a public repository' but does not specify any particular software versions (e.g., Python version, library versions like PyTorch, TensorFlow, etc.) used in the implementation. |
| Experiment Setup | Yes | The results are visualized in Figure 3 for a deep attention network with l = 2 layers, t = 3, ai = 2 for all i, and di = δ varying from 3 to 10. |