reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks

Authors: Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus Geiger

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments with Llama3-8B, Hyper DAS achieves state-of-the-art performance on the RAVEL benchmark for disentangling concepts in hidden states.
Researcher Affiliation	Collaboration	Pr(Ai)2R Group Stanford University Confirm Labs Ghent University
Pseudocode	No	The paper describes methods using numbered steps and equations (e.g., Section 3.1 to 3.4), but no explicitly labeled 'Pseudocode' or 'Algorithm' block is present. Figure 1 is a diagram, not pseudocode.
Open Source Code	No	The paper does not contain an explicit statement about releasing its code, nor does it provide a link to a code repository for the Hyper DAS methodology.
Open Datasets	Yes	We benchmark Hyper DAS on the RAVEL interpretability benchmark (Huang et al., 2024), in which concepts related to a type of entity are disentangled. The RAVEL benchmark evaluates how well an interpretability method can localize and disentangle entity attributes through causal interventions.
Dataset Splits	Yes	Table 1: The details of the dataset used for the experiment, in the format of train/test splits. For every model in each setting. Methods are trained on the full dataset of that setting with 5 epochs. The prompts used by the train/test splits are completely disjoint. Example: City 34899/7016 49500/9930 3552/3374 (referring to train/test splits for # of Cause Example, # of Isolate Example, # of Entity respectively).
Hardware Specification	No	The paper mentions 'Llama3-8B' and 'Our target Llama model requires 16GB of RAM' but does not specify the type of GPU, CPU, or other hardware used for running the experiments or training the models.
Software Dependencies	No	The paper mentions 'Llama3-8B (Meta, 2024)' as the target model but does not specify any software versions for libraries, frameworks, or programming languages used (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Crucial Hyperparameters We use 8 decoder blocks for the hypernetwork and 32 attention heads for computing the pairwise token position attention. The sparsity loss weight is scheduled to linearly increase from 0 to 1.5, starting at 50% of the total steps. A learning rate between 2e-4 to 2e-5 is chosen depending on the dataset. Discussion of these choices concerning the sparsity loss is in Section 4.2. For the feature subspace, we experiment with dimensions from 32 up to 2048 (out of 4096 dimensions) and use a subspace of dimension 128.