TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data
Authors: Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on ten datasets with diverse properties demonstrate Tab NAT s superiority in both unconditional tabular data generation and conditional missing data imputation tasks. ... We conduct comprehensive experiments on ten tabular datasets of various data types and scales to verify the efficacy of the proposed Tab NAT. ... Our ablation studies further validate the effectiveness of each component in our proposed framework. |
| Researcher Affiliation | Academia | 1Computer Science Department, University of Illinois at Chicago, Chicago, United States 2Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, Cambridge, United States. Correspondence to: Qitian Wu <EMAIL>, Philip S. Yu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Loss for discrete columns ... Algorithm 2 Loss for continuous columns ... Algorithm 3 Tab NAT: Training Process |
| Open Source Code | No | The paper does not explicitly state that the source code for the Tab NAT methodology described in the paper is openly available, nor does it provide a direct link to a code repository for Tab NAT. It mentions the codebase for baselines in footnote 17 but not for their own proposed method. |
| Open Datasets | Yes | The dataset used in this paper could be automatically downloaded using the script in the provided code. We use 10 tabular datasets from Kaggle2 or UCI Machine Learning Repository3: Adult4, Default5, Shoppers6, Magic7, Beijing8, and News9, California10, Letter11, Car12, and Nursery13, which contains varies number of numerical and categorical features. |
| Dataset Splits | Yes | All datasets (except Adult) are split into training and testing sets with the ratio 9 : 1 with a fixed random seed. As Adult has its official testing set, we directly use it as the testing set. For Machine Learning Efficiency (MLE) evaluation, the training set will be further split into training and validation split with the ratio 8 : 1. |
| Hardware Specification | Yes | We run our experiments on a single machine with Intel i9-14900K, Nvidia RTX 4090 GPU with 24 GB memory. |
| Software Dependencies | Yes | The code is written in Python 3.10.14 and we use Py Torch 2.2.2 on CUDA 12.2 to train the model on the GPU. |
| Experiment Setup | Yes | Tab NAT uses a fixed set of hyperparameters for all datasets. Table 7 shows the hyperparameters. ... Type Parameter Value optimizer Adam initial learning rate 1e-3 weight decay 1e-6 LR scheduler Reduce LROn Plateau training epochs 5000 batch size 1024 Transformers #Transformer blocks 6 embedding dim d 32 diffusion dimension ddiff 512 #heads 4 |