reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Equivalent Linear Mappings of Large Language Models

Authors: James Robert Golden

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate equivalent linearity in model families including Qwen 3, Gemma 3 and Llama 3, at a range of sizes up to Qwen 3 14B. The detached Jacobian reconstructions match the predicted embedding, with relative error (the norm of the reconstruction error divided by the norm of the output embedding) less than 10^-13 for double floating-point precision. See reconstructions for Llama 3.2 3B and Gemma 3 4B in Fig. A2. The paper includes multiple figures and tables presenting empirical results, comparisons, and analysis derived from these LLMs, such as singular value decomposition of the detached Jacobian.
Researcher Affiliation	Industry	James R. Golden EMAIL Oakland, CA. The author's email address uses a .com domain (gmail.com), which, according to the provided rules, indicates an industry affiliation.
Pseudocode	No	The paper describes the methodology using mathematical equations and textual explanations (e.g., equations 1-16 and the surrounding text), and presents schematic diagrams (e.g., Figure 1A), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/jamesgolden1/equivalent-linear-LLMs/.
Open Datasets	No	The paper mentions evaluating models like Qwen 3, Gemma 3, and Llama 3 using '100 short input phrases' and specific input sequences (e.g., 'The bridge out of Marin is the') for analysis. However, it does not provide concrete access information (link, DOI, repository, or citation) for these specific input phrases or any other datasets used for evaluation.
Dataset Splits	No	The paper focuses on analyzing existing Large Language Models and their predictions rather than training new models from scratch. Consequently, it does not describe traditional training, validation, or test dataset splits. While it mentions using '100 short input phrases' for analysis, no specific split information is provided for these phrases.
Hardware Specification	Yes	The numerical computation of the full detached Jacobian matrix takes on the order of 10 seconds for an input sequence of 8 tokens for Llama 3.2 3B in float32 on a GPU with 24 GB VRAM. In contrast, the full Jacobian matrix for the same sequence at float64 precision with Qwen 3 14B on a GPU with 40 GB VRAM takes 20 seconds. The maximum length tested on a GPU with 80 GB VRAM was over 400 tokens...
Software Dependencies	No	The paper mentions using 'Py Torch' and 'JAX' for implementation and analysis (e.g., 'In Py Torch, this is accomplished by cloning and detaching the x tensor...' and 'Lanczos iteration has also been implemented in JAX for Gemma 3 4B'), but it does not specify any version numbers for these software packages or other dependencies.
Experiment Setup	Yes	We exploit a property of transformer decoders wherein every operation... can be expressed as A(x) x, where A(x) represents an input-dependent linear transform and x preserves the linear pathway. The method reconstructs the predicted output embedding with relative error below 10^-13 at double floating-point precision, requiring no additional model training. For the steering operator, the input sequence's embedding vectors x new are multiplied by the detached Jacobian previously computed from the steering concept J+ L(x steer), scaled by λ and added to the layer activation f Li from the new input.