reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data

Authors: Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on ten datasets with diverse properties demonstrate Tab NAT s superiority in both unconditional tabular data generation and conditional missing data imputation tasks. ... We conduct comprehensive experiments on ten tabular datasets of various data types and scales to verify the efficacy of the proposed Tab NAT. ... Our ablation studies further validate the effectiveness of each component in our proposed framework.
Researcher Affiliation	Academia	1Computer Science Department, University of Illinois at Chicago, Chicago, United States 2Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, Cambridge, United States. Correspondence to: Qitian Wu <EMAIL>, Philip S. Yu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Loss for discrete columns ... Algorithm 2 Loss for continuous columns ... Algorithm 3 Tab NAT: Training Process
Open Source Code	No	The paper does not explicitly state that the source code for the Tab NAT methodology described in the paper is openly available, nor does it provide a direct link to a code repository for Tab NAT. It mentions the codebase for baselines in footnote 17 but not for their own proposed method.
Open Datasets	Yes	The dataset used in this paper could be automatically downloaded using the script in the provided code. We use 10 tabular datasets from Kaggle2 or UCI Machine Learning Repository3: Adult4, Default5, Shoppers6, Magic7, Beijing8, and News9, California10, Letter11, Car12, and Nursery13, which contains varies number of numerical and categorical features.
Dataset Splits	Yes	All datasets (except Adult) are split into training and testing sets with the ratio 9 : 1 with a fixed random seed. As Adult has its official testing set, we directly use it as the testing set. For Machine Learning Efficiency (MLE) evaluation, the training set will be further split into training and validation split with the ratio 8 : 1.
Hardware Specification	Yes	We run our experiments on a single machine with Intel i9-14900K, Nvidia RTX 4090 GPU with 24 GB memory.
Software Dependencies	Yes	The code is written in Python 3.10.14 and we use Py Torch 2.2.2 on CUDA 12.2 to train the model on the GPU.
Experiment Setup	Yes	Tab NAT uses a fixed set of hyperparameters for all datasets. Table 7 shows the hyperparameters. ... Type Parameter Value optimizer Adam initial learning rate 1e-3 weight decay 1e-6 LR scheduler Reduce LROn Plateau training epochs 5000 batch size 1024 Transformers #Transformer blocks 6 embedding dim d 32 diffusion dimension ddiff 512 #heads 4