Equivalent Linear Mappings of Large Language Models
Authors: James Robert Golden
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate equivalent linearity in model families including Qwen 3, Gemma 3 and Llama 3, at a range of sizes up to Qwen 3 14B. The detached Jacobian reconstructions match the predicted embedding, with relative error (the norm of the reconstruction error divided by the norm of the output embedding) less than 10^-13 for double floating-point precision. See reconstructions for Llama 3.2 3B and Gemma 3 4B in Fig. A2. The paper includes multiple figures and tables presenting empirical results, comparisons, and analysis derived from these LLMs, such as singular value decomposition of the detached Jacobian. |
| Researcher Affiliation | Industry | James R. Golden EMAIL Oakland, CA. The author's email address uses a .com domain (gmail.com), which, according to the provided rules, indicates an industry affiliation. |
| Pseudocode | No | The paper describes the methodology using mathematical equations and textual explanations (e.g., equations 1-16 and the surrounding text), and presents schematic diagrams (e.g., Figure 1A), but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/jamesgolden1/equivalent-linear-LLMs/. |
| Open Datasets | No | The paper mentions evaluating models like Qwen 3, Gemma 3, and Llama 3 using '100 short input phrases' and specific input sequences (e.g., 'The bridge out of Marin is the') for analysis. However, it does not provide concrete access information (link, DOI, repository, or citation) for these specific input phrases or any other datasets used for evaluation. |
| Dataset Splits | No | The paper focuses on analyzing existing Large Language Models and their predictions rather than training new models from scratch. Consequently, it does not describe traditional training, validation, or test dataset splits. While it mentions using '100 short input phrases' for analysis, no specific split information is provided for these phrases. |
| Hardware Specification | Yes | The numerical computation of the full detached Jacobian matrix takes on the order of 10 seconds for an input sequence of 8 tokens for Llama 3.2 3B in float32 on a GPU with 24 GB VRAM. In contrast, the full Jacobian matrix for the same sequence at float64 precision with Qwen 3 14B on a GPU with 40 GB VRAM takes 20 seconds. The maximum length tested on a GPU with 80 GB VRAM was over 400 tokens... |
| Software Dependencies | No | The paper mentions using 'Py Torch' and 'JAX' for implementation and analysis (e.g., 'In Py Torch, this is accomplished by cloning and detaching the x tensor...' and 'Lanczos iteration has also been implemented in JAX for Gemma 3 4B'), but it does not specify any version numbers for these software packages or other dependencies. |
| Experiment Setup | Yes | We exploit a property of transformer decoders wherein every operation... can be expressed as A(x) x, where A(x) represents an input-dependent linear transform and x preserves the linear pathway. The method reconstructs the predicted output embedding with relative error below 10^-13 at double floating-point precision, requiring no additional model training. For the steering operator, the input sequence's embedding vectors x new are multiplied by the detached Jacobian previously computed from the steering concept J+ L(x steer), scaled by λ and added to the layer activation f Li from the new input. |