reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

F-Fidelity: A Robust Framework for Faithfulness Evaluation of Explainable AI

Authors: Xu Zheng, Farhad Shirani, Zhuomin Chen, Chaohao Lin, Wei Cheng, Wenbo Guo, Dongsheng Luo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on multiple data modalities, such as images, time series, and natural language. The results demonstrate that F-Fidelity significantly improves upon prior evaluation metrics in recovering the ground-truth ranking of the explainers. Furthermore, we provide comprehensive empirical faithfulness evaluations on a collection of explainers that are systematically degraded from an original explainer through controlled random perturbations. Thus the correct (ground-truth) ranking of the explainers, in terms of faithfulness, is known beforehand.
Researcher Affiliation	Collaboration	1Florida International University, Miami, United States 2NEC Laboratories America, Princeton, United States 3University of California, Santa Barbara, United States
Pseudocode	Yes	Algorithm 1 Computing FFid+,FFid
Open Source Code	Yes	The source code is available at https://trustai4s-lab.github.io/ffidelity.
Open Datasets	Yes	We use CIFAR-100 (Krizhevsky et al., 2009) and Tiny-Imagenet (Deng et al., 2009)1, as the benchmark datasets. We use two benchmark datasets for time series analysis: PAM for human activity recognition and Boiler for mechanical fault detection (Queen, 2023). We use two benchmark datasets for our NLP experiments: the Stanford Sentiment Treebank (SST2) (Socher et al., 2013) for binary sentiment classification and the Boolean Questions (Bool Q) (Socher et al., 2013) dataset for question-answering tasks. We select a subset of 400 samples with an explanation size ratio of 0.2 from the colored-MNIST dataset (Arjovsky et al., 2019).
Dataset Splits	Yes	We use CIFAR-100 (Krizhevsky et al., 2009)... It contains 50,000 training images and 10,000 test images. Tiny-Imagenet (Deng et al., 2009)... contains 200 classes and 500 training images per class, 50 validation images, and 50 test images per class. For SST2, we utilize 67,349 sentences for training and 872 for testing. Bool Q comprises 9,427 question-answer pairs for training and 3,270 for testing.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It only refers to general model architectures like Res Net and Vi T.
Software Dependencies	No	The paper mentions using optimizers like Adam and AdamW, and libraries like Captum (e.g., 'Captum library'), but does not provide specific version numbers for these software components or programming languages (e.g., 'Python 3.x', 'PyTorch 1.x').
Experiment Setup	Yes	In the training stage, we set the learning rate and weight decay to 1E-4 for Res Net and set the learning rate to 1E-3 and weight decay to 1E-4 for Vi T. We use Adam as the optimizer and the training epochs are 100 for Res Net and 200 for Vi T. During fine-tuning, we use the same hyperparameters. The simple LSTM model contains 3 bidirectional LSTM layers with 128 hidden embedding sizes. We use Adam W as the optimizer with a learning rate of 1E-3 and weight decay is 1E-2. The LSTM has one hidden layer with the dimension set to 128. In Transformer, the hidden dimension is set to 512. The number of Transformer layers for SST2 and Bool Q are 2 and 4. The head of the Transformer layers for SST2 and Bool Q are 4 and 8. For all datasets and architectures, we use Adam optimizer (Kingma & Ba, 2015) with default learning rate 1E-4, training epochs 100.