reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CohEx: A Generalized Framework for Cohort Explanation

Authors: Fanyu Meng, Xin Liu, Zhaodan Kong, Xin Chen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the proposed method, we experiment on three scenarios: (1) the patient classification problem in Sect. 1, a small scale example to demonstrate the importance of using supervised clustering and revising importance; (2) bike sharing, a classical ML regression problem to predict the hourly number of rented bikes (Fanaee-T 2013), and (3) MNIST digit classification (Deng 2012), a vision-based task with deep learning to mimic more realistic and complex scenarios. For baselines, we compare with the following methods: VINE, which applies supervised clustering on the local importances scores (Britton 2019); REPID, a tree-based partitioning method (Herbinger, Bischl, and Casalicchio 2022). For the sake of comparison, we modify the algorithm to use a local explainer that is consistent with other method, instead of the original partial dependency explainer 1; GALE, a homogeneity-based re-weighting mechanism to improve the quality of importance aggregation in classification problems (Van Der Linden, Haned, and Kanoulas 2019). GALE is not a cohort explainer by definition, but we feed the GALE importance into VINE and REPID to assess the quality of the reweighed importance. Note that GALE is not applicable to regression problems; Hierarchical cohort explanation: first compute the local importance, then run supervised clustering with SRIDHCR once. The difference between this method and the proposed Coh Ex is that it does not iteratively recompute the local importance scores. 8.1 Synthetic Patient Classification 8.2 Bike Sharing 8.3 Digit Classification 8.4 Evaluation Metrics and Analysis
Researcher Affiliation	Academia	Fanyu Meng1, Xin Liu1, Zhaodan Kong1, Xin Chen2 1University of California, Davis 2Georgia Institute of Technology EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Generalized cohort explanation conversion framework (Coh Ex) Input : Number of iterations n, Expected number of cohorts k, Target model M, Dataset X, Data-driven local explanation method ω, Supervised clustering algorithm g. Output: Cohort assignment ax [1, k], x X, Cohort explanations wj, j [1, k]. 1 for i = 1, . . . , n do 2 randomly select k centroids c1, . . . , ck from X; 3 for x X do 4 ai x arg min1 a k x ca 2 ; /* assign each sample to the closest centroids / 7 for j = 1, ..., k do 8 Xj {x\|ai x = j} ; / The samples in this cohort / 9 for x Xj do 10 wi x ωM(Xj, x) ; / Recompute explanations using samples only in the cohort / x Xj wi x)/\|Xj\| ; / Compute the average local explanations / 14 k, {ai x} g(k, X, {wi x}) ; / Recluster using the new explanations */ 15 until clustering loss does not decrease for t iterations; 17 return the best {ai x} and { ei j} with the lowest clustering loss
Open Source Code	Yes	Code https://github.com/fy-meng/cohex
Open Datasets	Yes	bike sharing, a classical ML regression problem to predict the hourly number of rented bikes (Fanaee-T 2013), and (3) MNIST digit classification (Deng 2012)
Dataset Splits	No	The paper mentions using a "test dataset" for MNIST and describes experiments on specific datasets but does not explicitly provide percentages, sample counts, or detailed methodologies for training/validation/test splits for any of the datasets used.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The paper mentions software like "Python s SHAP package", "XGBoost model", "LIME", and "Deep LIFT" but does not specify their version numbers.
Experiment Setup	Yes	For this evaluation, the expected number of cohorts k is set to 4, and the maximum depth of the tree in REPID is set to 2 so that the number of cohorts is consistent for all methods. We use LIME as the base method... We continue to use k = 4 or a depth of two for a similar number of cohorts and use SHAP as the local explainer. We built the target model as a neural network with two convolutional layers and two fully-connected layers and achieved an accuracy of 93.93% on the test dataset.