Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis

Authors: Seunghwan An, Gyeongdong Woo, Jaesung Lim, ChangHyun Kim, Sungchul Hong, Jong-June Jeon

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our proposed model, we evaluate its performance in synthetic data generation across 10 real-world datasets, demonstrating its ability to adjust data privacy levels easily without retraining. Additionally, since masked input tokens in MLM are analogous to missing data, we further assess its effectiveness in handling training datasets with missing values, including multiple imputations of the missing entries. ... We conduct experiments in which we can provide answers to the following three experimental questions: Q1. Does Ma Co DE achieve state-of-the-art performance in synthetic data generation? Q2. Can Ma Co DE generate high-quality synthetic data even when faced with missing data scenarios? Q3. Is Ma Co DE capable of supporting multiple imputations for deriving statistically valid inferences from missing data?
Researcher Affiliation Academia 1Department of Statistical Data Science, University of Seoul, S. Korea 2Department of Statistics, University of Seoul, S. Korea 3Department of Statistics, Changwon National University, S. Korea EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Synthetic data generation Initialize: For all j, ˆyj 0 and ˆmj 0. Output: A synthetic sample ˆx. 1: for j = randperm{1, 2, , p} do 2: ˆyj Cat πj(ˆy ˆm; θ)/τ 3: ˆmj 1 4: for j = 1, 2, , p do 5: if j IC then 6: u U(bˆyj 1, bˆyj) 7: ˆxj ˆF 1 j (u) 8: if j ID then 9: ˆxj ˆyj
Open Source Code Yes Code https://github.com/an-seunghwan/Ma Co DE
Open Datasets Yes we utilize 10 publicly available real tabular UCI and Kaggle1 datasets of varying sizes and the number of columns. Detailed statistics of these datasets are provided in the Appendix. ... 1https://archive.ics.uci.edu/, https://www.kaggle.com/datasets/
Dataset Splits Yes For each random seed, the dataset is randomly split into training and testing sets with an 80% training and 20% testing ratio in the evaluation of Q1 and Q2.
Hardware Specification Yes We run experiments using NVIDIA A10 GPU.
Software Dependencies No The paper mentions various baseline models (CTGAN, TVAE, Tab DDPM, etc.) and refers to 'Detailed hyperparameter settings are provided in the Appendix.' However, it does not provide specific version numbers for software libraries (e.g., Python, PyTorch) or other dependencies used for implementing and running the proposed method.
Experiment Setup Yes For Ma Co DE, we set L = 50 and τ = 1 for all datasets. Detailed hyperparameter settings are provided in the Appendix.