Interpreting Neurons in Deep Vision Networks with Language Models
Authors: Nicholas Bai, Rahul Ajay Iyer, Tuomas Oikarinen, Akshay R. Kulkarni, Tsui-Wei Weng
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have conducted extensive qualitative and quantitative analysis to show that Dn D outperforms prior work by providing higher quality neuron descriptions. |
| Researcher Affiliation | Academia | Nicholas Bai EMAIL UC San Diego Rahul A. Iyer EMAIL UT Austin Tuomas Oikarinen EMAIL UC San Diego Akshay Kulkarni EMAIL UC San Diego Tsui-Wei Weng EMAIL UC San Diego |
| Pseudocode | Yes | An overview of Describe-and-Dissect (Dn D) and these 3 steps are illustrated in Figure 2. ... The algorithm consists of 4 substeps. |
| Open Source Code | Yes | Our code and data are available at https://github.com/Trustworthy-ML-Lab/Describe-and-Dissect. |
| Open Datasets | Yes | Res Net-50 and Res Net-18 (He et al., 2016) trained on Image Net (Russakovsky et al., 2015) and Place365 (Zhou et al., 2016) respectively. ... We dissected both a Res Net-50 network pretrained on Imagenet-1K and Res Net-18 trained on Places365, using the union of Image Net validation dataset and Broden (Bau et al., 2017) as our probing dataset. ... Tile2Vec (Jean et al., 2019) utilizes a modified Res Net-18 backbone trained to minimize triplet loss between anchor, neighbor, and distant land tiles from the NAIP dataset (Claire Boryan & Craig, 2011). ... We also evaluate a Res Net-50 model trained on labeled Euro SAT images (Helber et al., 2019) with 10 land cover classes. |
| Dataset Splits | Yes | We use the union of the Image Net validation dataset and Broden as Dprobe and compare to Network Dissection (Bau et al., 2017), MILAN (Hernandez et al., 2022), and CLIP-dissect (Oikarinen & Weng, 2023) as baselines. ... To compare the performance, following Oikarinen & Weng (2023), we use our model to describe the final layer neurons of Res Net-50 (where we know their ground truth role) and compare description similarity to the class name that neuron is detecting, as discussed in Section 4.2. |
| Hardware Specification | Yes | One limitation of Describe-and-Dissect is the relatively high computational cost, taking on average about 38.8 seconds per neuron with a Tesla V100 GPU. |
| Software Dependencies | No | The first model is Bootstrapping Language-Image Pretraining (BLIP) (Li et al., 2022), which is an image-to-text model... The second model is GPT-3.5 Turbo, which is a transformer model developed by Open AI... The third model is Stable Diffusion (Rombach et al., 2022)... |
| Experiment Setup | Yes | The top K most highly activating images for a neuron n are collected in set I, |I| = K, by selecting K images xi Dprobe Dcropped with the largest g(Ak(xi)). ... For the purposes of our experiments, we generate N = 5 candidate concepts unless otherwise mentioned. ... For the purposes of the experiments in this paper, we set Q = 10. ... For our experiments, we use t = 10. In practice, Rj is computed as the square of the ranks in top β = 5 ranking images for better differentiation between scores, Rj = {(Ri j)2; i β}. ... For both models we evaluated 4 of the intermediate layers (end of each residual block), with 200 randomly chosen neurons per layer for Res Net50 and 50 per layer for Res Net-18. |