Explaining Black-box Model Predictions via Two-level Nested Feature Attributions with Consistency Property

Authors: Yuya Yoshikawa, Masanari Kimura, Ryotaro Shimizu, Yuki Saito

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments on two tasks of image and text domains to evaluate the effectiveness of the proposed method, referred to as C2FA, implemented with Algorithm 1 in Appendix A. Its hyperparameters are provided in Appendix B. Comparing Methods. We used five methods for comparison: LIME [Ribeiro et al., 2016], MILLI [Early et al., 2022], Bottom-Up LIME (BU-LIME), Top-Down LIME (TDLIME), and Top-Down MILLI (TD-MILLI).
Researcher Affiliation Collaboration Yuya Yoshikawa1 , Masanari Kimura2 , Ryotaro Shimizu3 and Yuki Saito3 1STAIR Lab, Chiba Institute of Technology 2School of Mathematics and Statistics, The University of Melbourne 3ZOZO Research EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Estimating consistent two-level feature attributions (C2FA) with ℓ2 regularization
Open Source Code No The text mentions "In the official SHAP library [shap (Git Hub), 2024]" which refers to a third-party tool, not the authors' own code for the proposed methodology. There is no explicit statement or link provided for the code of the method described in this paper.
Open Datasets Yes We constructed an MIL dataset from the Pascal VOC semantic segmentation dataset [Everingham et al., 2015] with the ground-truth instanceand pixel-level labels. We constructed a dataset in which the validation and test sets contain 500 and 1,000 product reviews, respectively, randomly sampled from the Amazon reviews dataset [Zhang et al., 2015], respectively.
Dataset Splits Yes The number of samples in training, validation, and test subsets is 5,000, 1,000, and 2,000, respectively, and the positive and negative samples ratio is equal. We constructed a dataset in which the validation and test sets contain 500 and 1,000 product reviews, respectively, randomly sampled from the Amazon reviews dataset [Zhang et al., 2015], respectively.
Hardware Specification Yes The experiments were conducted on a server with an Intel Xeon Gold 6148 CPU and an NVIDIA Tesla V100 GPU.
Software Dependencies No The paper mentions software like Adam optimizer, BERT, ResNet-50, and Hugging Face, but does not provide specific version numbers for these components as they were used in the experimental setup. For example, it mentions "Adam optimizer [Kingma and Ba, 2015]" without specifying the version used.
Experiment Setup Yes We trained the model using our MIL image classification dataset with Adam optimizer [Kingma and Ba, 2015] with a learning rate of 0.001, a batch size of 32, and a maximum epoch of 300. The hyperparameters of C2FA, λH, λL, and µ1, were tuned using the validation subset of each dataset within the following ranges: λH, λL {0.1, 1}, and µ2 {0.001, 0.01, 0.1}. The remaining hyperparameters were set to µ1 = 0.1 and ϵ1 = ϵ2 = 10^-4.