Gnothi Seauton: Empowering Faithful Self-Interpretability in Black-Box Transformers

Authors: Shaobo Wang, Hongxuan Tang, Mingyang Wang, Hongrui Zhang, Xuyang Liu, Weiya Li, Xuming Hu, Linfeng Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that Auto Gnothi offers accurate explanations for both vision and language tasks, delivering superior computational efficiency with comparable interpretability. Section 5. Experiments.
Researcher Affiliation Collaboration 1School of Artificial Intelligence, Shanghai Jiao Tong University 2Efficient and Precision Intelligent Computing Lab, Shanghai Jiao Tong University 3Sichuan University 4Big Data and AI Lab, ICBC 5Hong Kong University of Science and Technology, Guangzhou EMAIL
Pseudocode No The paper describes methods and equations but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets Yes For image classification, we used the Image Nette (Howard & Gugger, 2020), Oxford IIIT-Pets (Parkhi et al., 2012), and MURA (Rajpurkar et al., 2017) datasets, following (Covert et al., 2022). For sentiment analysis, we utilized the Yelp Review Polarity dataset (Zhang et al., 2015).
Dataset Splits Yes The Image Nette dataset includes 9, 469 training samples and 3, 925 validation samples for 10 classes. MURA (musculoskeletal radiographs) has 36, 808 training samples and 3, 197 validation samples for 2 classes. The Oxford-IIIT Pets dataset contains 5, 879 training samples, 735 validation samples and 735 test samples in 37 classes. For text classifiers, each of the epoch is trained on exactly 2048 training samples randomly chosen from the dataset, and validated on 256 equally random test samples. We selected 1000 test samples for datasets Image Nette, MURA and Yelp Review Polarity, and randomly selected 300 samples for the Oxford-IIIT Pets dataset.
Hardware Specification Yes Our experiments were conducted on one 128-core AMD EPYC 9754 CPU with one NVIDIA Ge Force RTX 4090 GPU with 24 GB VRAM. No multi-card training or inference was involved.
Software Dependencies No We implemented the training and inference pipelines for image classification tasks under the Py Torch Lightning framework, and for the sentiment analysis task with just Py Torch. For evaluations on baseline methods we leverage the SHAP library (Lundberg, 2017), with minor modifications applied to bridge data format differences between Numpy and Py Torch. Specific version numbers for these software components are not provided.
Experiment Setup Yes We fine-tune this classifier model on the exact same dataset with the Adam W optimizer, using a learning rate of 10 5 for 25 epochs, and retain the best checkpoint rated by minimal validation loss. We train the explainer model for 100 epochs with the Adam W optimizer, using a learning rate of 10 5, and keep the best checkpoint. In our implementation, we took 2 input images in each mini-batch and generated 16 random masks for each image, resulting in a parallelism of 32 instances per batch. Auto Gnothi incorporates the same number of MSA blocks as the black-box model being explained in the side network and utilizes a reduction factor of r = 8 for the lightweight side branch on both the Image Nette and Oxford IIITPets datasets, and r = 4 for MURA and Yelp Review Polarity.