CTSyn: A Foundation Model for Cross Tabular Data Generation

Authors: Xiaofeng Lin, Chenheng Xu, Matthew Yang, Guang Cheng

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through large-scale pre-training, CTSyn outperforms existing table synthesizers on standard benchmarks in both utility and diversity. These results position CTSyn as a promising framework for synthetic table generation and lay the groundwork for developing large-scale tabular foundation models.
Researcher Affiliation Academia Xiaofeng Lin1, Chenheng Xu1, Matthew Yang2, Guang Cheng1 1Department of Statistics and Data Science, University of California Los Angeles CA, USA 2Department of Computer Science, University of California Los Angeles, CA, USA EMAIL
Pseudocode No The paper describes methods in sections like "3. METHODOLOGY", "3.1 FEATURE EMBEDDING", "3.2 AUTOENCODER FOR HETEROGENEOUS TABLES", and "3.3 CONDITIONAL DIFFUSION MODEL FOR LATENT VECTOR GENERATION" using paragraph text and mathematical formulations, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper provides URLs for the implementations of baseline models in Section B "BASELINES IMPLEMENTATION", but it does not provide an explicit statement or a link to the source code for CTSyn itself.
Open Datasets Yes We use a filtered version of the Open Tab dataset (Ye et al., 2024) as our pretraining set. ... For downstream benchmarking, we evaluate eleven real-world datasets widely used in the tabular synthesis literature (Suh et al., 2023; Kotelnikov et al., 2023; Zhang et al., 2024) (see Table 1). ... Appendix C "BENCHMARK DATASETS" lists the URLs for the sources of each downstream benchmark set, including Open ML, UCI, and Kaggle datasets.
Dataset Splits Yes For each downstream dataset, we randomly split the data into a fine-tuning set (80%) and a held-out test set (20%). The fine-tuning set is then randomly shuffled, and few-shot subsets are created by selecting the first 30, 50, 100, 200, and 500 rows, respectively.
Hardware Specification Yes Our training are completed on an Amazon AWS g5.12xlarge instance, with 192 GB system memory, 4 Nvidia A10G GPU with 4 24 GB GPU memory.
Software Dependencies No The paper mentions using the "Adam W optimizer" and "GTE-large (Li et al., 2023a) as our text embedding model," but it does not provide specific version numbers for these or other key software components used in their implementation.
Experiment Setup Yes Implementation: We use four cross-attention layers for both the encoder and decoder, with ℓ= 16 latent dimensions and Magg = 64. All VAE models in this paper are trained using the Adam W optimizer with an initial learning rate of 0.0002. The learning rate is multiplied by 0.95 if the validation loss does not improve for 10 consecutive epochs. ... We use a β-VAE setup, starting with βmax = 10 2, and gradually decrease β by multiplying it by 0.7 when the reconstruction loss does not improve for 5 consecutive epochs, until reaching a minimum value of 10 5. ... Following the specifications in Lovelace et al. (2024), our diffusion model uses a pre-Layer Norm transformer architecture with 12 layers, a hidden dimension of 768, learnable absolute positional encodings, and a Ge GLU activation function. The noise level is conditioned via a sinusoidal time embedding, which is processed by an MLP and added to the input sequence. Adaptive layer normalization is applied to each feedforward layer, conditioned on the time embedding. We use the Adam W optimizer with a learning rate of 0.0001, a cosine annealing scheduler, a batch size of 256, and 250 sampling steps. ... For CTSyn, we pre-train the autoencoder for 300 epochs and the diffusion model for 200,000 steps. For fine-tuning, we train the conditional diffusion model and decoder network of the autoencoder while freezing the encoder to maintain alignment in the latent space. We fine-tune the decoder for 100 epochs and the diffusion model for 10,000 steps.