Extracting Interpretable Task-Specific Circuits from Large Language Models for Faster Inference
Authors: Jorge García-Carrasco, Alejandro Maté, Juan Trujillo
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on different tasks and show that the resulting models are (i) considerably smaller, reducing the number of parameters up to 82.77% and (ii) more interpretable, as they focus on the circuit that is used to carry out the specific task, and can therefore be understood using MI techniques. |
| Researcher Affiliation | Academia | Jorge Garc ıa-Carrasco, Alejandro Mat e, Juan Trujillo Department of Software and Computing Systems, University of Alicante, Spain EMAIL, EMAIL |
| Pseudocode | Yes | The pseudocode of our approach is presented in Algorithm 1. Essentially, given a LLM fθ and a dataset that elicits the specific task of interest1 (which is split into a patching and validation datasets Da and Dv), our method is able to automatically obtain a pruned model gθ that is able to perform such task. This process will be controlled by several hyperparameters, namely the threshold α, the type of ablation used (either zero or mean ablation) and whether or not to prune MLPs. Algorithm 1: Automatic Task-Specific Circuit Extraction Data: Model fθ, patching dataset Da, validation dataset Dv,evaluation threshold α, ablation scheme, include mlps Result: Pruned model gθ gθ fθ for layer [num layers(fθ), ...0] do |
| Open Source Code | Yes | The code and data required to reproduce the experiments and figures, as well as the supplementary materials, can be found in https://github.com/jgcarrasco/circuit-extraction |
| Open Datasets | No | Given a dataset that elicits the specific task of interest1 (which is split into a patching and validation datasets Da and Dv) and "1Refer to Appendix A for a further discussion on the nature and curation of this dataset." The main paper does not provide concrete access information for the specific dataset used. |
| Dataset Splits | No | Given a dataset that elicits the specific task of interest1 (which is split into a patching and validation datasets Da and Dv). While splits are mentioned, no specific ratios, sample counts, or detailed splitting methodology are provided to reproduce the data partitioning. |
| Hardware Specification | Yes | The experiments were performed on a RTX4090 GPU, on an estimated total of 72 hours of compute. |
| Software Dependencies | No | Our method is implemented on Py Torch (Paszke et al. 2019) by using the Transformer Lens (Nanda and Bloom 2022) and Hugging Face transformer (Wolf et al. 2020) libraries. This lists software components but does not provide specific version numbers for reproducibility. |
| Experiment Setup | Yes | This process will be controlled by several hyperparameters, namely the threshold α, the type of ablation used (either zero or mean ablation) and whether or not to prune MLPs. The thresholds were selected according to the results of the previous section, and mean ablation is used across all runs. For the baseline comparison, 'The model is trained by minimizing Ldistill for a total of 20000 epochs with the Adam optimizer (Kingma 2014) and a learning rate of 10 3.' |