Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
Authors: Zubair Bashir, Bhavik Chandna, Procheta Sen
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For demographic and gender bias-related experiments, we used two different types of datasets. The Demographic Bias dataset used in our experiment is from an existing study in Narayanan Venkit et al. (2023). [...] Table 3 shows the performance on two NLP tasks: Co NLL-2003 Sang & De Meulder (2003), a named entity recognition benchmark, and Co LA Warstadt (2019), a linguistic acceptability judgment task for all the models. |
| Researcher Affiliation | Academia | Zubair Bashir EMAIL Indian Institute of Technology, Kharagpur; Bhavik Chandna EMAIL University of California San Diego; Procheta Sen EMAIL University of Liverpool |
| Pseudocode | No | The paper describes methodologies and mathematical formulations but does not contain any clearly labeled pseudocode or algorithm blocks. Descriptions are provided in prose. |
| Open Source Code | Yes | Our code is available at https://github.com/zubair2004/MI_Bias. |
| Open Datasets | Yes | The Demographic Bias dataset used in our experiment is from an existing study in Narayanan Venkit et al. (2023). [...] To understand Gender Bias in models, we used the set of 320 professions chosen and annotated from Bolukbasi et al. (2016b). |
| Dataset Splits | Yes | Each model was fine-tuned for 20 epochs on the respective datasets, with early stopping based on validation loss. Evaluation was conducted on held-out validation splits, and circuit changes were analyzed using attention weight inspection and edge attribution methods within Transformer Lens. |
| Hardware Specification | Yes | We conducted all the experiments in a computing machine having two A100 GPUs. |
| Software Dependencies | No | The paper mentions "Hooked-Transformer from Transformer Lens repository" and "Distilbert-base-uncased model" but does not provide specific version numbers for these or any other key software dependencies. |
| Experiment Setup | Yes | Fine-tuning was performed with the Adam W optimizer using a learning rate of 10 4 and a batch size of 129. Each model was fine-tuned for 20 epochs on the respective datasets, with early stopping based on validation loss. |