Interpreting Neurons in Deep Vision Networks with Language Models

Authors: Nicholas Bai, Rahul Ajay Iyer, Tuomas Oikarinen, Akshay R. Kulkarni, Tsui-Wei Weng

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have conducted extensive qualitative and quantitative analysis to show that Dn D outperforms prior work by providing higher quality neuron descriptions.
Researcher Affiliation Academia Nicholas Bai EMAIL UC San Diego Rahul A. Iyer EMAIL UT Austin Tuomas Oikarinen EMAIL UC San Diego Akshay Kulkarni EMAIL UC San Diego Tsui-Wei Weng EMAIL UC San Diego
Pseudocode Yes An overview of Describe-and-Dissect (Dn D) and these 3 steps are illustrated in Figure 2. ... The algorithm consists of 4 substeps.
Open Source Code Yes Our code and data are available at https://github.com/Trustworthy-ML-Lab/Describe-and-Dissect.
Open Datasets Yes Res Net-50 and Res Net-18 (He et al., 2016) trained on Image Net (Russakovsky et al., 2015) and Place365 (Zhou et al., 2016) respectively. ... We dissected both a Res Net-50 network pretrained on Imagenet-1K and Res Net-18 trained on Places365, using the union of Image Net validation dataset and Broden (Bau et al., 2017) as our probing dataset. ... Tile2Vec (Jean et al., 2019) utilizes a modified Res Net-18 backbone trained to minimize triplet loss between anchor, neighbor, and distant land tiles from the NAIP dataset (Claire Boryan & Craig, 2011). ... We also evaluate a Res Net-50 model trained on labeled Euro SAT images (Helber et al., 2019) with 10 land cover classes.
Dataset Splits Yes We use the union of the Image Net validation dataset and Broden as Dprobe and compare to Network Dissection (Bau et al., 2017), MILAN (Hernandez et al., 2022), and CLIP-dissect (Oikarinen & Weng, 2023) as baselines. ... To compare the performance, following Oikarinen & Weng (2023), we use our model to describe the final layer neurons of Res Net-50 (where we know their ground truth role) and compare description similarity to the class name that neuron is detecting, as discussed in Section 4.2.
Hardware Specification Yes One limitation of Describe-and-Dissect is the relatively high computational cost, taking on average about 38.8 seconds per neuron with a Tesla V100 GPU.
Software Dependencies No The first model is Bootstrapping Language-Image Pretraining (BLIP) (Li et al., 2022), which is an image-to-text model... The second model is GPT-3.5 Turbo, which is a transformer model developed by Open AI... The third model is Stable Diffusion (Rombach et al., 2022)...
Experiment Setup Yes The top K most highly activating images for a neuron n are collected in set I, |I| = K, by selecting K images xi Dprobe Dcropped with the largest g(Ak(xi)). ... For the purposes of our experiments, we generate N = 5 candidate concepts unless otherwise mentioned. ... For the purposes of the experiments in this paper, we set Q = 10. ... For our experiments, we use t = 10. In practice, Rj is computed as the square of the ranks in top β = 5 ranking images for better differentiation between scores, Rj = {(Ri j)2; i β}. ... For both models we evaluated 4 of the intermediate layers (end of each residual block), with 200 randomly chosen neurons per layer for Res Net50 and 50 per layer for Res Net-18.