Tab-Shapley: Identifying Top-k Tabular Data Quality Insights
Authors: Manisha Padala, Lokesh Nagalapatti, Atharv Tyagi, Ramasuri Narayanam, Shiv Kumar Saini
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels. ... We evaluate the performance of the Tab-Shapley algorithm by comparing it with two baseline approaches: 1) DIFFI and 2) SHAP. Our experiments demonstrate that Tab-Shapley achieves more efficient ranking of attributes and rows compared to the baselines. |
| Researcher Affiliation | Collaboration | 1IIT Gandhinagar, 2 IIT Bombay, 3Adobe Research EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Tab-Shapley Algorithm ... Algorithm 2: Extract top-K data insights |
| Open Source Code | No | The paper does not contain any explicit statement about providing source code, nor does it include a link to a code repository. |
| Open Datasets | Yes | In our evaluation, we consider 12 real-world datasets (D1: Arrhythmia , D2: Ionosphere , D3: Letter , D4: SAT-IMAGE , D5: SPECT , D6: Speech , D7: PIMA , D8: Vertebral , D9: Optdigits , D10: WBC , D11: Wine Red , D12: Wine White ) that provide ground truth labels for both record-level and attribute-level anomalies (Xu et al. 2021). ... Additionally, we also use two popular datasets that do not provide attribute-level ground truth information: (i) KDD Cup 1999 Dataset We use the 10% version of the data obtained from the UCI Machine Learning Archive, following a similar pre-processing approach as in (Antwarg et al. 2021). (ii) Forest Cover Dataset. |
| Dataset Splits | No | The paper mentions using several datasets and references previous work for ground truth labels and pre-processing, but it does not explicitly state the train/test/validation splits used for its own experiments or how the data was partitioned for evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using TABNET for auto-encoder training, k-means algorithm for clustering, and Kadane's algorithm. However, it does not specify any version numbers for these or any other software libraries or dependencies. |
| Experiment Setup | Yes | During training, 50% of the features are randomly masked, and the TABNET predicts only the masked features. ... To determine the threshold on these errors for identifying anomalous records, we use clustering. ... We further scale the non-anomalous scores with a factor α > 0, that controls the number of non-anomalous cells that we can afford to have in each insight. Figure 2 shows a declining trend for the percentage of non-anomalous cells with an increase of α, which is as expected. ... The results are shown for α = 0.2; higher values of α would create smaller blocks. |