reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bilinear MLPs enable weight-based mechanistic interpretability

Authors: Michael Pearce, Thomas Dooms, Alice Rigg, Jose Oramas, Lee Sharkey

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Analyzing the spectra of bilinear MLP weights using eigendecomposition reveals interpretable low-rank structure across toy tasks, image classification, and language modeling. We consider models trained on the MNIST dataset of handwritten digits and the Fashion-MNIST dataset of clothing images.
Researcher Affiliation	Collaboration	Michael T. Pearce Independent pearcemt@ alumni.stanford.edu Thomas Dooms University of Antwerp thomas.dooms@ uantwerpen.be Alice Rigg Independent rigg.alice0@ gmail.com Jose Oramas University of Antwerp, sq IRL/IDLab EMAIL Lee Sharkey Apollo Research EMAIL
Pseudocode	No	The paper describes its methods and procedures in narrative text, without including any structured pseudocode or algorithm blocks.
Open Source Code	Yes	*Equal contribution Code at: https://github.com/tdooms/bilinear-decomposition
Open Datasets	Yes	We consider models trained on the MNIST dataset of handwritten digits and the Fashion-MNIST dataset of clothing images. ... (Eldan & Li, 2023) (see training details in Appendix G). ... The models used in the experiments shown in Figure 9 are trained of the Fine Web dataset (Penedo et al., 2024).
Dataset Splits	No	The paper uses well-known datasets like MNIST and Fashion-MNIST, but does not explicitly detail the training, validation, and test splits used for its experiments. For language models, it mentions 'context length 256' and 'context length 512' but no specific dataset splits (e.g., train/val/test percentages or counts).
Hardware Specification	Yes	We thank Core Weave for providing compute for the finetuning experiments. ... we fine-tuned Tiny Llama-1.1B, ... using a single A40 GPU.
Software Dependencies	No	The paper details experimental setups and hyperparameters in Appendix G but does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages.
Experiment Setup	Yes	This section contains details about our architectures used and hyperparameters to help reproduce results. More information can be found in our code [currently not referenced for anonymity]. (followed by tables 1, 2, 3, and 4 listing specific hyperparameters like learning rate, batch size, optimizer, epochs, etc.)