Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

STaSy: Score-based Tabular data Synthesis

Authors: Jayoung Kim, Chaejeong Lee, Noseong Park

ICLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, we also conduct rigorous experimental studies in terms of the generative task trilemma: sampling quality, diversity, and time. In our experiments with 15 benchmark tabular datasets and 7 baselines, our method outperforms existing methods in terms of task-dependant evaluations and diversity.
Researcher Affiliation Academia Jayoung Kim, Chaejeong Lee, and Noseong Park Department of Artificial Intelligence Yonsei University Seoul, South Korea EMAIL
Pseudocode Yes Algorithm 1 shows the overall training process for our STa Sy.
Open Source Code Yes Source codes used in the experiments are available in the supplementary material. By following the README guidance, the main results are easily reproducible.
Open Datasets Yes The raw data of 15 datasets are available online: Credit: https://www.kaggle.com/mlg-ulb/creditcardfraud (Db CL 1.0) ... Spambase: https://archive.ics.uci.edu/ml/datasets/spambase (CC BY 4.0)
Dataset Splits Yes The train-test split ratio is 80% and 20%, respectively.
Hardware Specification Yes Our software and hardware environments are as follows: UBUNTU 18.04 LTS, PYTHON 3.8.2, PYTORCH 1.8.1, CUDA 11.4, and NVIDIA Driver 470.42.01, i9 CPU, and NVIDIA RTX 3090.
Software Dependencies Yes Our software and hardware environments are as follows: UBUNTU 18.04 LTS, PYTHON 3.8.2, PYTORCH 1.8.1, CUDA 11.4, and NVIDIA Driver 470.42.01, i9 CPU, and NVIDIA RTX 3090.
Experiment Setup Yes Hyperparameter settings for the best models are in Table 27. We have three SDE types, which are VE, VP, and sub-VP, and three layer types as shown in Appendix C: Concat, Squash, and Concatsquash. We use a learning rate in {2e 03, 2e 04}. We search for α0 and β0, in total, with 9 combinations using α0 = {0.20, 0.25, 0.30} and β0 = {0.80, 0.90, 0.95}.