reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tab-Shapley: Identifying Top-k Tabular Data Quality Insights

Authors: Manisha Padala, Lokesh Nagalapatti, Atharv Tyagi, Ramasuri Narayanam, Shiv Kumar Saini

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels. ... We evaluate the performance of the Tab-Shapley algorithm by comparing it with two baseline approaches: 1) DIFFI and 2) SHAP. Our experiments demonstrate that Tab-Shapley achieves more efficient ranking of attributes and rows compared to the baselines.
Researcher Affiliation	Collaboration	1IIT Gandhinagar, 2 IIT Bombay, 3Adobe Research EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Tab-Shapley Algorithm ... Algorithm 2: Extract top-K data insights
Open Source Code	No	The paper does not contain any explicit statement about providing source code, nor does it include a link to a code repository.
Open Datasets	Yes	In our evaluation, we consider 12 real-world datasets (D1: Arrhythmia , D2: Ionosphere , D3: Letter , D4: SAT-IMAGE , D5: SPECT , D6: Speech , D7: PIMA , D8: Vertebral , D9: Optdigits , D10: WBC , D11: Wine Red , D12: Wine White ) that provide ground truth labels for both record-level and attribute-level anomalies (Xu et al. 2021). ... Additionally, we also use two popular datasets that do not provide attribute-level ground truth information: (i) KDD Cup 1999 Dataset We use the 10% version of the data obtained from the UCI Machine Learning Archive, following a similar pre-processing approach as in (Antwarg et al. 2021). (ii) Forest Cover Dataset.
Dataset Splits	No	The paper mentions using several datasets and references previous work for ground truth labels and pre-processing, but it does not explicitly state the train/test/validation splits used for its own experiments or how the data was partitioned for evaluation.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using TABNET for auto-encoder training, k-means algorithm for clustering, and Kadane's algorithm. However, it does not specify any version numbers for these or any other software libraries or dependencies.
Experiment Setup	Yes	During training, 50% of the features are randomly masked, and the TABNET predicts only the masked features. ... To determine the threshold on these errors for identifying anomalous records, we use clustering. ... We further scale the non-anomalous scores with a factor α > 0, that controls the number of non-anomalous cells that we can afford to have in each insight. Figure 2 shows a declining trend for the percentage of non-anomalous cells with an increase of α, which is as expected. ... The results are shown for α = 0.2; higher values of α would create smaller blocks.