reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explaining Black-box Model Predictions via Two-level Nested Feature Attributions with Consistency Property

Authors: Yuya Yoshikawa, Masanari Kimura, Ryotaro Shimizu, Yuki Saito

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments on two tasks of image and text domains to evaluate the effectiveness of the proposed method, referred to as C2FA, implemented with Algorithm 1 in Appendix A. Its hyperparameters are provided in Appendix B. Comparing Methods. We used five methods for comparison: LIME [Ribeiro et al., 2016], MILLI [Early et al., 2022], Bottom-Up LIME (BU-LIME), Top-Down LIME (TDLIME), and Top-Down MILLI (TD-MILLI).
Researcher Affiliation	Collaboration	Yuya Yoshikawa1 , Masanari Kimura2 , Ryotaro Shimizu3 and Yuki Saito3 1STAIR Lab, Chiba Institute of Technology 2School of Mathematics and Statistics, The University of Melbourne 3ZOZO Research EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Estimating consistent two-level feature attributions (C2FA) with ℓ2 regularization
Open Source Code	No	The text mentions "In the official SHAP library [shap (Git Hub), 2024]" which refers to a third-party tool, not the authors' own code for the proposed methodology. There is no explicit statement or link provided for the code of the method described in this paper.
Open Datasets	Yes	We constructed an MIL dataset from the Pascal VOC semantic segmentation dataset [Everingham et al., 2015] with the ground-truth instanceand pixel-level labels. We constructed a dataset in which the validation and test sets contain 500 and 1,000 product reviews, respectively, randomly sampled from the Amazon reviews dataset [Zhang et al., 2015], respectively.
Dataset Splits	Yes	The number of samples in training, validation, and test subsets is 5,000, 1,000, and 2,000, respectively, and the positive and negative samples ratio is equal. We constructed a dataset in which the validation and test sets contain 500 and 1,000 product reviews, respectively, randomly sampled from the Amazon reviews dataset [Zhang et al., 2015], respectively.
Hardware Specification	Yes	The experiments were conducted on a server with an Intel Xeon Gold 6148 CPU and an NVIDIA Tesla V100 GPU.
Software Dependencies	No	The paper mentions software like Adam optimizer, BERT, ResNet-50, and Hugging Face, but does not provide specific version numbers for these components as they were used in the experimental setup. For example, it mentions "Adam optimizer [Kingma and Ba, 2015]" without specifying the version used.
Experiment Setup	Yes	We trained the model using our MIL image classification dataset with Adam optimizer [Kingma and Ba, 2015] with a learning rate of 0.001, a batch size of 32, and a maximum epoch of 300. The hyperparameters of C2FA, λH, λL, and µ1, were tuned using the validation subset of each dataset within the following ranges: λH, λL {0.1, 1}, and µ2 {0.001, 0.01, 0.1}. The remaining hyperparameters were set to µ1 = 0.1 and ϵ1 = ϵ2 = 10^-4.