TabCBM: Concept-based Interpretable Neural Networks for Tabular Data

Authors: Mateo Espinosa Zarlenga, Zohreh Shams, Michael Edward Nelson, Been Kim, Mateja Jamnik

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method in both synthetic and real-world tabular tasks and show that Tab CBM outperforms or performs competitively compared to state-of-the-art methods, while providing a high level of interpretability as measured by its ability to discover known high-level concepts.
Researcher Affiliation Collaboration Mateo Espinosa Zarlenga EMAIL Department of Computer Science and Technology University of Cambridge Zohreh Shams EMAIL Department of Computer Science and Technology University of Cambridge Michael Edward Nelson EMAIL Keyrock European Bioinformatics Institute University of Cambridge Been Kim EMAIL Google Deep Mind Mateja Jamnik EMAIL Department of Computer Science and Technology University of Cambridge
Pseudocode No The paper describes the model architecture and algorithm steps using mathematical notation and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes All of the code needed to reproduce our results, and use our model through a simple API, has been made public at https://github.com/mateoespinosa/tabcbm via an MIT license.
Open Datasets Yes Datasets We evaluate our method on both synthetic and real-world tabular datasets. We construct four synthetic tabular datasets of increasing complexity: Synth-Linear, Synth-Nonlin, Synth-Nonlin Large, and Synth-sc RNA. ... Finally, we use three real-world datasets with unknown ground-truth concepts: (1) PBMC (10x Genomics, 2016a;b) as a high-dimensional single-cell transcriptomic dataset, (2) Higgs (Aad et al., 2012) as a large real-world physics tabular dataset... and (3) FICO (Fair Isaac Corporation, 2019) as a high-stakes financial task...
Dataset Splits Yes For each method, and across all tasks, we split each task s dataset into 80% training data and 20% test data and generate a validation set by randomly sampling without substitution 20% of the training data.
Hardware Specification No We choose a specific batch size to maximise GPU utilisation while remaining within our hardware s memory capabilities. We used an Adam optimiser (Kingma & Ba, 2014) with learning rate 10 3, momentum 0.99, and standard hyperparameters β1 = 0.9 and β2 = 0.999, across all methods and tasks. ... With this aim in mind, we fix the architecture used across methods to be the same for a given dataset. We select architectures that are simple to train, yet large and expressive enough to perform well in each task of interest; with the constraint that they should train in our GPU cluster within reasonable times.
Software Dependencies No We built our code base using a combination of Tensor Flow (Abadi et al., 2016) and Py Torch (Paszke et al., 2019) and implemented Tab CBM in Tensor Flow. All of the code needed to reproduce our results, and use our model through a simple API, has been made public at https://github.com/mateoespinosa/tabcbm via an MIT license.
Experiment Setup Yes For each method, and across all tasks, we split each task s dataset into 80% training data and 20% test data and generate a validation set by randomly sampling without substitution 20% of the training data. ... We used an Adam optimiser (Kingma & Ba, 2014) with learning rate 10 3, momentum 0.99, and standard hyperparameters β1 = 0.9 and β2 = 0.999, across all methods and tasks. ... Hyperparameter values used for each dataset are reported in Table 5.