reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LIMIS: Locally Interpretable Modeling using Instance-wise Subsampling

Authors: Jinsung Yoon, Sercan O Arik, Tomas Pfister

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show on multiple tabular datasets that LIMIS near-matches the prediction accuracy of black-box models, significantly outperforming state-of-the-art locally interpretable models in terms of fidelity and prediction accuracy. We next study LIMIS on 3 real-world regression datasets: (1) Blog Feedback, (2) Facebook Comment, (3) News Popularity; and 2 real-world classification datasets: (4) Adult Income, (5) Weather. We evaluate the performance on disjoint testing sets Dt = {(xt k, yt k)}L k=1 P and report the results over 10 independent runs.
Researcher Affiliation	Industry	Jinsung Yoon, Sercan Ö. Arik, Tomas Pﬁster EMAIL Google Cloud AI
Pseudocode	Yes	Pseudo-code of the LIMIS training is in Algorithm. 1. Pseudo-code of the LIMIS inference is in Algorithm. 2.
Open Source Code	No	The paper does not provide concrete access to source code for the LIMIS methodology described. It only provides links to benchmark models (LIME, SILO, MAPLE) in Appendix C: Implementations of benchmark models. There is no explicit statement or link for LIMIS code.
Open Datasets	Yes	We next study LIMIS on 3 real-world regression datasets: (1) Blog Feedback, (2) Facebook Comment, (3) News Popularity; and 2 real-world classification datasets: (4) Adult Income, (5) Weather. These are well-known, publicly available benchmark datasets.
Dataset Splits	No	We evaluate the performance on disjoint testing sets Dt = {(xt k, yt k)}L k=1 P and report the results over 10 independent runs. If there is no explicit probe dataset, it can be randomly split from the training dataset (D). The paper mentions using training, probe, and test sets, and notes that a probe dataset can be randomly split from training data. However, it does not specify concrete percentages, absolute counts, or reference predefined splits for the datasets used.
Hardware Specification	Yes	On a single NVIDIA V100 GPU (without any hardware optimizations), LIMIS yields a training time of less than 5 hours (including Stage 1, 2 and 3) and an interpretable inference time of less than 10 seconds per testing instance. Training time is computed on a single K80 GPU until the model convergence (i.e., no more validation fidelity improvements).
Software Dependencies	No	The paper lists various predictive models and their hyperparameters (e.g., XGBoost, Light GBM, MLP, Ridge Regression) in Appendix A, along with optimizers and activation functions. However, it does not specify version numbers for any software libraries, frameworks, or languages used (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup	Yes	Hyper-parameters are optimized to maximize the validation fidelity. Appendix A: Hyper-parameters of the predictive models, details hyperparameters for XGBoost (booster gbtree, max depth 6, learning rate 0.3, number of estimators 1000, reg alpha 0), Light GBM (booster gbdt, learning rate 0.1, number of estimators 1000, min data in leaf 20), Random Forests (number of estimators 1000, criterion gini), Multi-layer Perceptron (Number of layers 4, hidden units [feature dimensions, feature dimensions/2, feature dimensions/4, feature dimensions/8], activation function Re LU, early stopping True with patience 10, batch size 256, maximum number of epochs 200, optimizer Adam), and others.