HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Authors: Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus Geiger
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments with Llama3-8B, Hyper DAS achieves state-of-the-art performance on the RAVEL benchmark for disentangling concepts in hidden states. |
| Researcher Affiliation | Collaboration | Pr(Ai)2R Group Stanford University Confirm Labs Ghent University |
| Pseudocode | No | The paper describes methods using numbered steps and equations (e.g., Section 3.1 to 3.4), but no explicitly labeled 'Pseudocode' or 'Algorithm' block is present. Figure 1 is a diagram, not pseudocode. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its code, nor does it provide a link to a code repository for the Hyper DAS methodology. |
| Open Datasets | Yes | We benchmark Hyper DAS on the RAVEL interpretability benchmark (Huang et al., 2024), in which concepts related to a type of entity are disentangled. The RAVEL benchmark evaluates how well an interpretability method can localize and disentangle entity attributes through causal interventions. |
| Dataset Splits | Yes | Table 1: The details of the dataset used for the experiment, in the format of train/test splits. For every model in each setting. Methods are trained on the full dataset of that setting with 5 epochs. The prompts used by the train/test splits are completely disjoint. Example: City 34899/7016 49500/9930 3552/3374 (referring to train/test splits for # of Cause Example, # of Isolate Example, # of Entity respectively). |
| Hardware Specification | No | The paper mentions 'Llama3-8B' and 'Our target Llama model requires 16GB of RAM' but does not specify the type of GPU, CPU, or other hardware used for running the experiments or training the models. |
| Software Dependencies | No | The paper mentions 'Llama3-8B (Meta, 2024)' as the target model but does not specify any software versions for libraries, frameworks, or programming languages used (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Crucial Hyperparameters We use 8 decoder blocks for the hypernetwork and 32 attention heads for computing the pairwise token position attention. The sparsity loss weight is scheduled to linearly increase from 0 to 1.5, starting at 50% of the total steps. A learning rate between 2e-4 to 2e-5 is chosen depending on the dataset. Discussion of these choices concerning the sparsity loss is in Section 4.2. For the feature subspace, we experiment with dimensions from 32 up to 2048 (out of 4096 dimensions) and use a subspace of dimension 128. |