reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Intuition: Rethinking Token Attributions inside Transformers

Authors: Jiamin Chen, Xuhong Li, Lei Yu, Dejing Dou, Haoyi Xiong

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method is further validated qualitatively and quantitatively through the faithfulness evaluations across different settings: single modality (BERT and Vi T) and bi-modality (CLIP), different model sizes (Vi T-L) and different pooling strategies (Vi T-MAE) to demonstrate the broad applicability and clear improvements over existing methods.4 Experiments We validate our proposed explanation method by comparing the results with several strong baselines. The experiment settings are based on two aspects: different modalities and different model versions. The experimental results show the clear advantages and wide applicability of our methods over the others in explaining Transformers. 4.1 Experimental Settings Faithfulness Evaluation. Following previous works (Abnar & Zuidema, 2020; Chefer et al., 2021a;b; Samek et al., 2017; Vu et al., 2019; De Young et al., 2020), we prepare three types of tests for the trustworthiness evaluation: 1) Perturbation Tests... 2) Segmentation Tests... 3) Language Reasoning... 4.4 Ablation Study We propose two ablation studies...
Researcher Affiliation	Collaboration	Jiamin Chen EMAIL Beihang University & Baidu Inc. Xuhong Li EMAIL Baidu Inc. Lei Yu EMAIL Beihang University & Beihang Hangzhou Innovation Institute Yuhang Dejing Dou EMAIL Baidu Inc. Haoyi Xiong EMAIL Baidu Inc.
Pseudocode	No	The paper describes its methodology using mathematical derivations and textual descriptions in Section 3, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Code available at https://github.com/jiaminchen-1031/transformerinterp and Interpret DL (Li et al., 2022) as well.
Open Datasets	Yes	Language Reasoning comes from a NLP benchmark ERASER (De Young et al., 2020) for rationales extraction... We select randomly 5 images per class (5k in total) from the Image Net validation set for the perturbation tests, and the dataset of Image Net-Segmentation (Guillaumin et al., 2014) for the segmentation tests... Movie Reviews Dataset (De Young et al., 2020)... 20 Newsgroups Dataset (Lang, 1995).
Dataset Splits	Yes	We select randomly 5 images per class (5k in total) from the Image Net validation set for the perturbation tests, and the dataset of Image Net-Segmentation (Guillaumin et al., 2014) for the segmentation tests. ... We finetune a BERT-base model on its training data, with the accuracy reaching 93% on testing set. We randomly select 3000 documents from the testing set for the perturbation.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions 'Interpret DL (Li et al., 2022)' as a tool used, but it does not specify version numbers for any software components (e.g., programming languages, libraries, frameworks) crucial for reproducibility.
Experiment Setup	No	The paper describes various experimental settings, such as different modalities (BERT, ViT, CLIP) and evaluation tests (Perturbation, Segmentation, Language Reasoning). However, it does not explicitly provide concrete hyperparameter values or detailed training configurations (e.g., learning rate, batch size, number of epochs, optimizer settings) for the models or baselines used in the experiments.