reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

[Re] Classwise-Shapley values for data valuation

Authors: Markus Semmler, Miguel de Benito Delgado

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate CS-Shapley, a data valuation method introduced in Schoch et al. (2022) for classification problems. We repeat the experiments in the paper, including two additional methods, the Least Core (Yan & Procaccia, 2021) and Data Banzhaf (Wang & Jia, 2023), a comparison not found in the literature. We include more conservative error estimates and additional metrics, like rank stability, and a variance-corrected version of Weighted Accuracy Drop, originally introduced in Schoch et al. (2022). We conclude that while CSShapley helps in the scenarios it was originally tested in, in particular for the detection of corrupted labels, it is outperformed by the conceptually simpler Data Banzhaf in the task of detecting highly influential points, except for highly imbalanced multi-class problems.
Researcher Affiliation	Industry	Markus Semmler EMAIL applied AI Initiative Gmb H Miguel de Benito Delgado EMAIL applied AI Institute g Gmb H
Pseudocode	No	The paper describes valuation methods and their equations (e.g., Equation 1, 2, 3, 4, 5, 6) and general procedures in narrative text, but does not include any distinct, structured pseudocode blocks or sections labeled as 'Algorithm' or 'Pseudocode'.
Open Source Code	Yes	Code for all our experiments is available in (Semmler, 2024), including both setups and instructions on running them. URL https://github.com/aai-institute/re-classwise-shapley.
Open Datasets	Yes	Datasets are from openml (Vanschoren et al., 2013). All but Covertype and MNIST-multi are for binary classification. ... Table 1: Datasets used. ... Diabetes tabular ... Click tabular ... Covertype tabular ... CPU tabular ... Phoneme tabular ... FMNIST image ... CIFAR10 image ... MNIST-binary image ... MNIST-multi image
Dataset Splits	Yes	Stratified sampling was used for the splits to maintain label distribution. ... Table 1: Datasets used. ... Training Valuation Test Diabetes 128 128 512 ... Click 500 500 2000 ... Covertype 500 500 2000 ... CPU 500 500 2000 ... Phoneme 500 500 2000 ... FMNIST 500 500 2000 ... CIFAR10 500 500 2000 ... MNIST-binary 500 500 2000 ... MNIST-multi 500 500 2000
Hardware Specification	No	We ran all experiments with the method implementations available in the open source library Py DVL v0.9.1 (Transfer Lab, 2022), on several high-cpu VMs of a cloud vendor. The text mentions 'high-cpu VMs of a cloud vendor' but does not specify exact CPU models, GPU models, or cloud instance types.
Software Dependencies	Yes	We ran all experiments with the method implementations available in the open source library Py DVL v0.9.1 (Transfer Lab, 2022)... Models used to compute values and changes made to the default parameters in scikit-learn 1.2.2.
Experiment Setup	Yes	Parameters for all methods were taken as suggested in Schoch et al. (2022) or the corresponding papers. ... Table 2: Methods evaluated. Convergence criteria as provided by py DVL (Transfer Lab, 2022). ... TMCS ... ε = 10 4 ... Beta Shap ... α = 16, β = 1 ... Data Banzhaf ... K = 5000 samples ... Least Core ... K = 5000 constraints. ... Table 3: Models used to compute values and changes made to the default parameters in scikit-learn 1.2.2. Logistic regression solver= liblinear Gradient Boosting classifier n_estimators=40, min_samples_leaf=6, maxdepth=2 K-Nearest Neighbours n_neighbors=5, weights= uniform SVM kernel= rbf