Universal Neurons in GPT2 Language Models
Authors: Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we study the universality of individual neurons across GPT2 models trained from different initial random seeds, motivated by the hypothesis that universal neurons are likely to be interpretable. In particular, we compute pairwise correlations of neuron activations over 100 million tokens for every neuron pair across five different seeds and find that 1-5% of neurons are universal, that is, pairs of neurons which consistently activate on the same inputs. We then study these universal neurons in detail, finding that they usually have clear interpretations and taxonomize them into a small number of neuron families. Results Figure 2 summarizes our results. In Figure 2a, we depict the average of the maximum neuron correlations across models [b-e], the average of the baseline correlations, and the excess correlation i.e., the left term, the right term, and the difference in (3). |
| Researcher Affiliation | Academia | Wes Gurnee EMAIL Massachusetts Institute of Technology Theo Horsley EMAIL University of Cambridge Zifan Carl Guo EMAIL Massachusetts Institute of Technology Tara Rezaei Kheirkhah EMAIL Massachusetts Institute of Technology Qinyi Sun EMAIL Massachusetts Institute of Technology Will Hathaway EMAIL Massachusetts Institute of Technology Neel Nanda EMAIL Dimitris Bertsimas EMAIL Massachusetts Institute of Technology |
| Pseudocode | No | The paper describes methods and equations in paragraph text and mathematical notation, such as MLP(x) = Woutσ(Winx + bin) + bout (1) and the correlation formula ρa,m i,j = E (vi µi)(vj µj) . There are no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | All of our code and data is available at https://github.com/wesg52/universal-neurons. |
| Open Datasets | Yes | We compute pairwise correlations of neuron activations over 100 million tokens for every neuron pair across the different seeds and find that only 1-5% of neurons pass a target threshold of universality compared to random baselines ( 4.1). We compute our correlations over a 100 million token subset of the Pile test set (Gao et al., 2020), tokenized to a context length of 512 tokens. |
| Dataset Splits | No | The paper analyzes pre-trained GPT2 models and evaluates them on a subset of the Pile test set. While it mentions '100 million tokens from the Pile test set', it does not specify training, validation, or test splits for the models it investigates, which are already trained. The context describes evaluation data rather than data splits for model training. |
| Hardware Specification | No | The paper mentions studying 'GPT2-small and GPT2-medium architecture' and 'Pythia-160m' models, but does not provide specific hardware details such as GPU or CPU models used for their experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow, CUDA versions). It only mentions general activation functions like 'gelu_new' in the model hyperparameters table. |
| Experiment Setup | Yes | Table 1: Hyperparameters of models, provided in Section B.3, details parameters such as 'layers 12 24 12', 'heads 12 16 12', 'dmodel 768 1024 768', 'dvocab 50257 50257 50304', 'dMLP 3072 4096 3072', 'parameters 160M 410M 160M', 'context 1024 1024 2048', 'activation function gelu_new gelu_new gelu', 'pos embeddings absolute absolute Ro PE', 'precision Float-32 Float-32 Float-16', 'dataset Openweb Text Openweb Text Pile', and 'pdropout 0.1 0.1 0'. |